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Abstract 

In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting 
the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and 
post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local 
learning rules is obtained by first specifying the nature of the local variables, and then the functional form that ties 
them together into each learning rule. Such a framework enables also the systematic discovery of new learning 
rules and exploration of relationships between learning rules and group symmetries. We study polynomial local 
learning rules stratified by their degree and analyze their behavior and capabilities in both linear and non-linear 
units and networks. Stacking local learning rules in deep feedforward networks leads to deep local learning. 

While deep local learning can learn interesting representations, it cannot learn complex input-output functions, 
even when targets are available for the top layer. Learning complex input-output functions requires local deep 
learning where target information is communicated to the deep layers through a backward learning channel. The 
nature of the communicated information about the targets and the structure of the learning channel partition the 
space of learning algorithms. For any learning algorithm, the capacity of the learning channel can be defined as 
the number of bits provided about the error gradient per weight, divided by the number of required operations 
per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation 
outperforms them by simultaneously maximizing the information rate and minimizing the computational cost. This 
result is also shown to be true for recurrent networks, by unfolding them in time. The theory clarifies the concept 
of Hebbian learning, establishes the power and limitations of local learning rules, introduces the learning channel 
which enables a formal analysis of the optimality of backpropagation, and explains the sparsity of the space of 
learning rules discovered so far. 

Keywords: machine learning; neural networks; deep learning; backpropagation; learning rules; Hebbian learning; 
learning channel; recurrent networks; recursive networks; supervised learning; unsupervised learning. 


1 Introduction 


The deep learning problem can be viewed as the problem of learning the connection weights of a large computational 
graphs, in particular the weights of the deep connections that are far away from the inputs or outputs olH . In spite 
of decades of research, only very few algorithms have been proposed to try to address this task. Among the most 
important ones, and somewhat in opposition to each other, are backpropagation 1501] and Hebbian learning 1261] . 
Backpropagation has been the dominant algorithm, at least in terms of successful applications, which have ranged 
over the years from computer vision 0] to high-energy physics 0- In spite of many attempts, no better algorithm 
has been found, at least within the standard supervised learning framework. In contrast to backpropagation which 
is a well defined algorithm-stochastic gradient descent-Hebbian learning has remained a more nebulous concept, 
often associated with notions of biological and unsupervised learning. While less successful than backpropagation in 
mmlications, it has periodically inspired the development of theories aimed at capturing the essence of neural learning 


ll2a, 1221 1291]. Within this general context, the goal of this work is to create a precise framework to organize and study 
the space of learning rules and their properties and address several questions, in particular: (1) What is Hebbian 
learning? (2) What are the capabilities and limitations of Hebbian learning? (3) What are the connections between 
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Hebbian learning and backpropagation? (4) Are there other learning algorithms better than backpropagation? These 
questions are addressed in two parts: the first part focuses on Hebbian learning, the second part on backpropagation. 


1.1 The Deep Learning Problem 


At the core of many neural system models is the idea that information is stored in synapses and typically represented 
by a “synaptic weight”. While synapses could conceivably be far more complex (e.g. Qq] ) and require multiple 
variables for describing their states, for simplicity here we will use the single synaptic weight framework, although 
the same ideas can readily be extended to more complex cases. In this framework, synapses are faced with the task of 
adjusting their individual weights in order to store relevant information and collectively organize in order to sustain 
neural activity leading to appropriately adapted behavior at the level of the organism. This is a daunting task if one 
thinks about the scale of synapses and how remote they can be from sensory inputs and motor outputs. Suffice it to 
say that when rescaled by a factor of 10®, a synapse is the size of a fist and the bow of the violin, or the tennis racket, 
it ought to help control is 1,000 miles away. This is the core of the deep learning problem. 


1.2 The Hebbian Learning Problem 

Donald Hebb is credited with being among the first to think about this problem and attempt to come up with a 
plausible solution in his 1949 book The Organization of Behavior 0]. However, Hebb was primarily a psychologist 
and his ideas were stated in rather vague terms, such as: “When an axon of cell A is near enough to excite cell B and 
repeatedly or persistently takes part in firing it, some growth process or metabolic change takes place in one or both 
cells such that A’s efficiency, as one of the cells firing B, is increased” often paraphrased as “Neurons that fire together 
wire together”. Not a single equation can be found in his book. 

While the concept of Hebbian learning has played an important role in the development of both neuroscience 
and machine learning, its lack of crispness becomes obvious as soon as one raises simple questions like: Is the 
backpropagation learning rule Hebbian? Is Oja’s learning rule ll44ll Hebbian? Is a rule that depends on a function 
of the output Bin Hebbian? Is a learning rule that depends only on the input Hebbian? and so forth. This lack of 
crispness is more than a simple semantic issue. While it may have helped the field in its early stages-in the same 
way that vague concepts like “gene” or “consciousness” may have helped molecular biology or neuroscience, it has 
also prevented clear thinking to address basic questions regarding, for instance, the behavior of linear networks under 
Hebbian learning, or the capabilities and limitations of Hebbian learning in both shallow and deep networks. 

At the same time, there have been several attempts at putting the concept of Hebbian learning at the center of bio¬ 
logical learning 1221 l29n . Hopfield proposed to use Hebbian learning to store memories in networks of symmetrically 
connected threshold gates. While the resulting model is elegant and amenable to interesting analyses, it oversimplifies 
the problem by considering only shallow networks, where all the units are visible and have targets. Fukushima pro¬ 
posed the neocognitron architecture for computer vision, inspired by the earlier neurophysiological work of Hubei and 
Wiesel 0, essentially in the form of a multi-layer convolutional neural network. Most importantly for the present 
work, Fukushima proposed to learn the parameters of the neocognitron architecture in a self-organized way using 
some kind of Hebbian mechanism. While the Fukushima program has remained a source of inspiration for several 
decades, a key result of this paper is to show that such a program cannot succeed at finding an optimal set of weights 
in a feedforward architecture, regardless of which specific form of Hebbian learning is being used. 


1.3 The Space of Learning Rules and its Sparsity Problem 

Partly related to the nebulous nature of Hebbian learning, is the observation that so far the entire machine learning 
field has been able to come up only with very few learning rules like the backpropagation rule and Hebb’s rule. Other 
familiar rules, such as the perceptron learning rule lld^ . the delta learning rule |55], and Oja’s rule |^], can be viewed 
as special cases of, or variations on, backpropagation or Hebb (Table [B- Additional variations are found also in, for 
instance, 113, 32, 37], and discussions of learning rules from a general standpoint in Jd, 34]. This creates a potentially 
unsatisfactory situation given that of the two most important learning algorithms, the first one could have been derived 
by Newton or Leibniz, and the second one is shrouded in vagueness. Furthermore, this raises the broader question of 
the nature of the space of learning rules. In particular, why does the space seem so sparse? Are there new rules that 
remain to be discovered in this space? 
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Learning Rule 

Expression 

Simple Hebb 

AtUjj oc OiOj 

Oja 

Awij oc OiOj — Ofwij 

Perceptron 

Awij oc (T — Oi)Oj 

Delta 

Awij oc (T - Oi)f'{Si)Oj 

Backprogation 

Awij oc BiOj 


Table 1: Common learning rules and their on-line expressions. Oi represents the activity of the postsynaptic neuron, 
Oj the activity of the presynaptic neuron, and Wij the synaptic strength of the corresponding connection. Bi represents 
the back-propagated error in the postsynaptic neuron. The perceptron and Delta learning rules were originally defined 
for a single unit (or single layer), in which case T is the readily available output target. 

2 A Framework for Local Learning Rules 

The origin of the vagueness of the Hebbian learning idea is that it indiscriminately mixes two fundamental but distinct 
ideas: (1) learning ought to depend on local information associated with the pre- and post-synaptic neurons; and (2) 
learning ought to depend on the correlation between the activities of these neurons, yielding a spectrum of possibilities 
on how these correlations are counted and used to change the synaptic weights. The concept of local learning rule, 
mentioned but not exploited in IJO, is more fundamental than the concept of Hebbian learning rule, as it explicitly 
exposes the more general notion of locality, which is implicit but somehow hidden in the vagueness of the Hebbian 
concept. 

2.1 The Concept of Locality 

To address all the above issues, the first observation is that in a physical implementation a learning rule to adjust 
a synaptic weight can only include local variables. Thus to bring clarity to the computational models, one must 
first define which variables are to be considered local in a given model. Consider the backpropagation learning 
rule Awij oc BiOj where Bi is the postsynaptic backpropagated error and Oj is the presynaptic activity. If the 
backpropagated error is not considered a local variable, then backpropagation is not a local learning rule, and thus 
is not Hebbian. If the backpropagated error is considered a local variable, then backpropagation may be Hebbian, 
both in the sense of being local and of being a simple product of local pre- and post-synaptic terms. [Note that, 
even if considered local, the backpropagated error may or may not be of the same nature (e.g. firing rate) as the 
presynaptic term, and this may invalidate its Hebbian character depending, again, on how one interprets the vague 
Hebbian concept.] 

Once one has decided which variables are to be considered local in a given model, then one can generally express 
a learning rule as 


Arujj = F(local variables) (1) 

for some function F. A systematic study of local learning requires a systematic analysis of many cases in terms of 
not only the functions F, but also in terms of the computing units and their transfer functions (e.g. linear, sigmoidal, 
threshold gates, rectified linear, stochastic, spiking), the network topologies (e.g. shallow/deep, autoencoders, feed¬ 
forward/recurrent), and other possible parameters (e.g. on-line vs batch). Here for simplicity we first consider single 
processing units with input-output functions of the form 


0 = f{S) = fC£w,Ij) (2) 

3 

where I is the input vector and the transfer function / is the identity in the linear case, or the [0,1] logistic function 
(T[o i](x) = 1/(1 -h e“^), or the [-1,1] hyperbolic tangent function = (1 — e“^^)/(l -|- e“^*), or the corre¬ 

sponding threshold functions rpp] and i]. When necessary, the bias is included in this framework by considering 
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that the input value Iq is always set to 1 and the bias is provided by corresponding weight wq- In the case of a network 
of N such units, we write 


Oi = f{S^) = f{Y,mjOj) (3) 

j 

where in general we assume that there are no self-connections {wii=0). In general, the computing units can be 
subdivided into three subsets corresponding to input units, output units, and hidden units. While this formalism 
includes both feedforward and recurrent networks, in the first part of the paper we will focus primarily on feedforward 
networks. However issues of feedback and recurrent networks will become important in the second part. 

Within this general formalism, we typically consider first that the local variables are the presynaptic activity, the 
postynaptic activity, and Wij so that 


Awij = F{Oi, Oj,Wij) (4) 

In supervised learning, in a model where the target Tj is considered a local variable, the rule can have the more general 
form 


Awij = F{Ti,Oi,Oj,Wij) (5) 

For instance, we will consider cases where the output is clamped to the value Ti, or where the error signal Ti — Oi 
is a component of the learning rule. The latter is the case in the perceptron learning algorithm, or in the deep targets 
algorithm described below, with backpropagation as a special case. Equation |5] represents a local learning rule if one 
assumes that there is a target Tj that is locally available. Targets can be clearly available and local for the output layer. 
However the generation and local availability of targets for deep layers is a fundamental, but separate, question that 
will be addressed in later sections. Thus it is essential to note that the concept of locality is orthogonal to the concept 
of unsupervised learning. An unsupervised learning rule can be non-local if F depends on activities or synaptic 
weights that are far away in the network. Likewise a supervised learning rule can be local, if the target is assumed to 
be a local variable. Finally, we also assume that the learning rate t/ is a local variable contained in the function F. For 
simplicity, we will assume that the value of p is shared by all the units, although more general models are possible. 

In short, it is time to move away from the vagueness of the term “Hebbian learning” and replace it with a clear 
definition, in each situation, of: (1) which variables are to be considered local; and (2) which functional form is used 
to combine the local variables into a local learning rule. A key goal is then to systematically study the properties of 
different local rules across different network types. 

2.1.1 Spiking versus Non-Spiking Neurons 

The concept of locality (Equation 1) is completely general and applies equally well to networks of spiking neurons 
and non-spiking neurons. The analyses of specific local learning rules in Sections 3-5 are conducted for non-spiking 
neurons, but some extensions to spiking neurons are possible (see, for instance, lls^ f. Most of the material in Sections 
6-8 is general again and is applicable to networks of spiking units. The main reason is that these sections are concerned 
primarily with the propagation of information about the targets from the output layer back to the deeper layers, 
regardless of how this information is encoded, and regardless of whether non-spiking or spiking neurons are used in 
the forward, or backward, directions. 

2.2 Coordinate Transformations and Symmetries 

This subsection is not essential to follow the rest of this paper and can initially be skipped. When studying local 
learning rules, it is important to look at the effects of coordinate transformations and various symmetries on the 
learning rules. While a complete treatment of these operations is beyond our scope, we give several specific examples 
below. In general, applying coordinate changes or symmetries can bring to light some important properties of a 
learning rule, and shows in general that the function F should not be considered too narrowly, but rather as a member 
of a class. 
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2.2.1 Example 1: Range Transformation (Affine Transformation) 

For instance, consider the narrow definition of Hebb’s rule as Awij oc OiOj applied to threshold gates with binary 
inputs. This definition makes some sense if the threshold gates are defined using a [—1,1] formalism, buf is problem- 
afic over a [0,1] formalism because if resulfs in Awij = OiOj being 0 in fhree ouf of four cases, and always positive 
and equal fo 1 in fhe remaining fourlh case. Thus fhe narrow definition of Hebb’s rule over a [0,1] system should 
be modified using fhe corresponding affine fransformafion. However fhe new expression will have fo be in fhe same 
functional class, i.e. in fhis case quadratic funcfion over fhe acfivifies. The same considerations apply when sigmoid 
fransfer funcfions are used. 

More specifically, [0,1] nefworks are Iransformed info [-1,1] nefworks fhrough fhe fransformafion x — 2x — 1 or 
vice versa fhrough fhe fransformafion x — (x + 1) /2. If is easy fo show fhaf a polynomial local rule in one fype of 
nefwork is iransformed info a polynomial local rule of fhe same degree in fhe ofher fype of nefwork. For insfance, a 
quadratic local rule wifh coefficienls ajo,i] j /3[o,i] j 7[o,i] > <^[ 0 , 1 ] of Ihe form 


Awij oc ajopjOjOj -h /3[o,i]Oi -h 7[o,i]Oj -h ()[op] 

is transformed into a rule with coefficients a[_i i], /3[_i i], 7 [_i i], (5[i ij through the homogeneous system: 

(6) 

“[-1,1] 

o' 

II 

(V) 

A-1,1] 

— 2/3[o^i] — 2a[o^i] 

(8) 

T'i-1,1] 

= 27[o,i] - 2a[o,i] 

(9) 

'^[-1,1] 

= [ 0 , 1 ] + «[o,i] - /5[o,i] - 7[o,i] 

(10) 


Note fhaf no non-zero quadratic rule can have fhe same form in bofh systems, even when frivial mulfiplicafive coeffi¬ 
cienls are absorbed info fhe learning rale. 


2.2.2 Example 2: Permutation of Training Examples 

Learning rules may be more or less sensitive to permutations in the order in which examples are presented. In order to 
analyze the behavior of most rules, here we will assume that they are not sensitive to the order in which the examples 
are presented, which is generally the case if all the training examples are treated equally, and the on-line learning rate 
is small and changes slowly so that averages can be computed over entire epochs (see below). 

2.2.3 Example 3: Network Symmetries 

When the same learning rule is applied isotropically, it is important to examine its behavior under the symmetries 
of the network architecture to which it is applied. This is the case, for instance, in Hopfield networks where all the 
units are connected symmetrically to each other (see next section), or between fully connected layers of a feedforward 
architecture. In particular, it is important to examine whether differences in inputs or in weight initializations can lead 
to symmetry breaking. It is also possible to consider models where different neurons, or different connections, use 
different rules, or rules in the same class (like Equation but with different coefficients. 

2.2.4 Example 4: Hypercube Isometries 

As a fourth example ||^ consider a Hopfield network B consisting of N threshold gates, with ±1 outputs, connected 
symmetrically to each other (wij = Wji). It is well known that such a system and its dynamics is characterized by 
the quadratic energy function E = —(1/2) ■ WijOiOj (note that the linear terms of the quadratic energy function 

are taken into account by Oq = 1). The quadratic function E induces an acyclic orientation O of the A^-dimensional 
Hypercube = [—1,1]^ where the edge between two neighboring (i.e. at Hamming distance 1) state spaces x 
and y is oriented from x to y if and only if E{x) > E{y). Patterns or “memories” are stored in the weights of the 
system by applying the simple Hebb rule Awij oc OiOj to the memories. Thus a given training set S produces 
a corresponding set of weights, thus a corresponding energy function, and thus a corresponding acyclic orientation 
0{S) of the hypercube. Consider now an isometry h of the A^-dimensional hypercube, i.e. a one-to-one function from 
to that preserves the Hamming distance. It is easy to see that all isometries can be generated by composing 
two kinds of elementary operations: (1) permuting two components; and (2) inverting the sign of a component (hence 
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the isometries are linear). It is then natural to ask what happens to 0{S) when h is applied to and thus to S. It 
can be shown that under the simple Hebb rule the “diagram commutes” (Figure IZTl) . In other words, h{S) is a new 
training set which leads to a new acyclic orientation 0{h{S)) and 


h{0{S)) = 0{h{S)) (11) 

Thus the simple Hebb rule is invariant under the action of the isometries of the hypercube. In Appendix A, we show 
it is the only rule with this property. 


0(S) 

A 

^w{S) 


• 0(h(S)) 

A 

^w{h{S)) 



h 



Figure 2.1: Commutative diagram for the simple Hebb rule in a Hopfield network. Application of the simple Hebb’s 
rule to a set S binary vectors over the [—1,1]^ hypercube in a Hopfield network with N units yields a set of symmetric 
weights Wij = Wji and a corresponding quadratic energy function which ultimately produces a directed acyclic 

orientation of the hypercube 0{S) directing the dynamics of the network towards minima of Ey,(^s)- isometry h 
over the hypercube yields a new set of vectors h{S) hence, by application of the same Hebb rule, a new set of weights, 
a new energy function £'u,(/i( 5 )), and a new acyclic orientation such that h{0{S)) = 0{h{S)). 


2.3 Functional Forms 

Within the general assumption that Amjj = F{Oi, Oj,Wij), or AtUjj = F(Tj, Oi, Oj,Wij) in the supervised case, one 
must consider next the functional form of F. Among other things, this allows one to organize and stratify the space of 
learning rules. As seen above, the function F cannot be defined foo narrowly, as if musf be invarianf fo cerfain changes, 
and fhus one is primarily inferesfed in classes of functions. In fhis paper, we focus exclusively on fhe case where F is 
a polynomial funclion of degree n (e.g. linear, quadrafic, cubic) in fhe local variables, allhough olher functional forms 
could be considered, such as rational funclions or power functions wifh rational exponenls. Mosl rules fhal are found 
in fhe neural nefwork liferalure correspond lo low degree polynomial rules. Thus we consider funclions F comprising 
a sum of terms of fhe form (or aT- ”*^) where a is a real coefficienl [in Ibis paper we 

assume fhal fhe conslanl a is fhe same for all fhe weighls, bul many of fhe analyses carry over lo fhe case where 
differenl weighls have differenl coefficienls allhough such a system is nol invarianl under relabeling of Ihe neurons]; 
riTi, non-negative integers satisfying nxi + rij + rij + < n. In Ihis term, fhe apparent degree 

of w is n^.. bul Ihe effective degree of w may be higher because Oi depends also on Wij, lypically in a linear way al 
leasl around Ihe currenl value of O*. In Ibis case, Ihe effective degree of Wij in Ibis term is n* + [For inslance, 
consider a rule of Ihe form Arujj = WijOflj, wilh a linear unil Oi = Wiklk- The apparenl degree of Ihe rule in 
Wij is 1, bul Ihe effective degree is 3.] Finally, we lei d{d < n) denote Ihe highesl effective degree of w, among all 
Ihe terms in F. As we shall see, n and d are the two main numbers of interest used to stratify the polynomial learning 
rules. 
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2.4 Terminology 


We do not expect to be able to change how the word Hebbian is used but the recommendation, used in the rest of 
this paper, is to replace Hebbian with the more precise concept of local learning rule, which assumes a pre-existing 
definition of which variables are to be considered local. Within the set of local learning rules, it is easy to see that in 
general linear (n = 1) learning rules of the form Awij oc aOi + [30j -|- 'ywij are not very useful (in fact the same 
is true also for rules of the form Awij oc h{Oi) + g{Oj) + k{wij) for any functions h, g, and k). Thus, for a local 
learning rule to be interesting it must be at least quadratic. 

Within quadratic learning rules, one could adopt the position that only Awij oc OiOj should be called Hebbian. 
At the other extreme, one could adopt the position that all quadratic rules with n = 2 should be called Hebbian. This 
would include the correlation rule 


Awij oc {Oi - E{Oi)){Oj - E{Oj)) (12) 

which requires information about the averages E{Oi) and E{Oj) over the training examples, and other rules of the 
form 


Awij oc aOiOj + /30i + 'jOj + 6 (13) 

Note that this is not the most general possible form since other terms can also be included (i.e. terms in Of, Of, Wij, 
wfp WijOi, and WijOj) and will be considered below. Note also that under any of these definitions of Hebbian, Oja’s 
rule 


Awij oc OiOj — Ofwij (14) 

is local, but not Hebbian since it is not quadratic in the local variables Oi, Oj, and Wij. Rather it is a cubic rule with 
n = 3 and d = 3. 

In any case, to avoid these terminological complexities, which result from the vagueness of the Hebbian concept, 
we will: (1) focus on the concept of locality; (2) stratify local rules by their degrees n and d; and (3) reserve “simple 
Hebb” for the rule Awij oc OiOj, avoiding to use “Hebb” in any other context. 

2.5 Time-scales and Averaging 

We assume a training set consisting of M inputs /(f) for f = 1,..., M in the unsupervised case, and M input-target 
pairs (/(f), T(f)) in the supervised case. On-line learning with local rules will exhibit stochastic fluctuations and 
the weights will change at each on-line presentation. However, with a small learning rate and randomized order of 
example presentation, we expect the long term behavior to be dominated by the average values of the weight changes 
computed over one epoch. Thus we assume that O varies rapidly over the training data, compared to the synaptic 
weights w which are assumed to remain constant over an epoch. The difference in time-scales is what enables the 
analysis since we assume the weight Wij remains essentially constant throughout an epoch and we can compute the 
average of the changes induced by the training data over an entire epoch. While the instantaneous evolution of the 
weights is governed by the relationship 


Wij{t -h 1) = Wij{t) + 7]AWij{t) 

the assumption of small ij allow us to average this relation over an entire epoch and write 


(15) 


Wij{k + 1) = Wij{k) + r]E{Awij) (16) 

where the index k is now over entire epochs, and E is the expectation taken over the corresponding epoch. Thus, in 
the analyses, we must first compute the expectation E and then solve the recurrence relation (Equation [Thll . or the 
corresponding differential equation. 
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2.6 Initial Roadmap 


The article is subdivided into two main parts. In the first part, the focus is on Hebbian learning, or more precisely 
on local learning rules. Because we are restricting ourselves to learning rules with a polynomial form, the initial 
goal is to estimate expectations of the form ) in the unsupervised case, or ) in 

the supervised case. Because of the time-scale assumption, within an epoch we can assume that Wij is constant and 
therefore the corresponding term factors out of the expectation. Thus we are left with estimating terms of the form 
in the unsupervised case, or in the supervised case. 

In terms of architectures, we are primarily interested in deep feedforward architectures and thus we focus first on 
layered feedforward networks, with local supervised or unsupervised learning rules, where local learning is applied 
layer by layer in batch mode, starting from the layer closest to the inputs. In this feedforward framework, within 
any single layer of units, all the units learn independently of each other given the inputs provided by the previous 
layer. Thus in essence the entire problem reduces to understanding learning in a single unit and, using the notation of 
Equation^ to estimating the expectations E{0'^°in the unsupervised case, or E{T'^'^0"^°) in the supervised 
case, where Ij are the inputs and O is the output of the unit being considered. In what follows, we first consider the 
linear case (Section |3]l and then the non-linear case (Section IHl. We then give examples of how new learning rules can 
be derived (Section |5]l. 

In the second part, the focus is on backpropagation. We first study the limitations of purely local learning in 
shallow or deep networks, also called deep local learning (Section O. To go beyond these limitations, naturally leads 
to the introduction of local deep learning algorithms and deep targets algorithms, and the study of the properties of 
the backward learning channel and the optimality of backpropagation (Sections 7 and 8). 

3 Local Learning in the Linear Case 

The study of feedforward layered linear networks is thus reduced to the study of a single linear unit of the form 
O = In this case, to understand the behavior of any local learning rule, one must compute expectations 

of the form 


= w'^^E 


k 


(17) 


This encompasses also the unsupervised case by letting ut = 0. Thus this expectation is a polynomial in the weights, 
with coefficients that correspond to the statistical moments of the training data of the form When 

this polynomial is linear in the weights (d < 1), the learning equation can be solved exactly using standard methods. 
When the effective degree is greater than I (d > 1), then the learning equation can be solved in some special cases, 
but not in the general case. 

To look at this analysis more precisely, here we assume that the learning rule only uses data terms of order two 
or less. Thus only the means, variances, and covariances of I and T are necessary to compute the expectations in the 
learning rule. For example, a term WiTO is acceptable, but not WiTO"^ which requires third-order moments of the 
data of the form E{TIiIj) to compute its expectation. To compute all the necessary expectations systematically, we 
will use the following notations. 


3.1 Notations 

• All vectors are column vectors. 

• A' denotes the transpose of the matrix A, and similarly for vectors. 

• tt is the N dimensional vector of all ones: n' = (1,1,...,!). 

• o is the Hadamard or Schur product, i.e. the component-wise product of matrices or vectors of the same size. 
We denote by the Schur product of v with itself k times, i.e. = v o v ... o v. 

• diagM is an operator that creates a vector whose components are the diagonal entries of the square matrix M. 

• When applied to a vector Diagu represents the square diagonal matrix whose components are the components 
of the vector M. 




• DiagM represents the square diagonal matrix whose entries on the diagonal are identical to those of M (and 0 
elsewhere), when M is a square matrix. 

• For the first order moments, we let E{Ii) = and E{T) = In vector form, ^ = {E{Ii)). 

• For the second order moments, we introduce the matrix Ejji = {E{lilj)) = (Cov(/j, Ij) + 

3.2 Computation of the Expectations 

With these notations, we can compute all the necessary expectations. Thus: 

• In Table |2l we list all the possible terms with n = 0 orn = 1 and their expectations. 

• In Table [3j we list all the possible quadratic terms with n = 2 and their expectations. 

• In Table m we list all the possible cubic terms with n = 3, requiring only first and second moments of the data, 
and their expectations. 

• In Table |5j we list all the possible terms of order n, requiring only first and second moments of the data, and 
their expectations. 

Note that in Table for the term the expectation in matrix form can be written as o DiagS/// = 

Diag(E///)u; o Thus in the cubic case where n = 3, the expectation has the form Diag{T;iii)w. Likewise, 

for w'^~‘^IiT the expectation in matrix form can also be written as o Ejt' = o (diagS/ 2 ’')u;. Thus in 

the cubic case where n = 3, the expectation has the form {diagTjiT')w- 

Note also that when there is a bias term, we consider that the corresponding input Jq is constant and clamped to 
1, so that E{Iq) = 1 for any n, and Iq can simply be ignored in any product expression. 

3.3 Solving the Learning Recurrence Relation in the Linear Case {d < 1) 

When the effective degree d satisfies d < 1, then the recurrence relation provided by Equation [18] is linear for any 
value of the overall degree n. Thus it can be solved by standard methods provided all the necessary data statistics are 
available to compute the expectations. More precisely, computing the expectation over one epoch leads to the relation 


w{k + 1) = Aw{k) + b (18) 

Starting from t(;(0) and iterating this relation, the solution can be written as 

w{k) = A’^w{0) + A’^-^b + A'^-^b +... + Ab + b = A’^w{0) + [l +A ++ ... + A’^-^]b (19) 

where I denotes the identity matrix. Furthermore, if ^4 — I is an invertible matrix, this expression can be written as 

w{k) = A^w{0) + [A^ - I)(A - 1)~A = A'^wiO) + {A- - 1)6 (20) 

When A is symmetric, there is an orthonormal matrix C such that A = CDC~^ = CDC', where D = Diag(Ai,..., Aat) 


is a diagonal matrix and Ai,..., Atv are the real eigenvalues of A. Then for any power k we have A^ = CD^C ^ = 
CDiag(A5^,..., A^)C-^ and - I = C(L>^ - I)C-^ = Cdiag(Af - 1,..., A^ - 1)C-^ so that Equation [H 
becomes 

w{k) = CD^C-^w{0) + C[1 + D + D^ + ... + D^-^]C-A = CD^C-^w{0) + CEC-^ (21) 

where E = Diag(^i,..., is a diagonal matrix with = (A” — l)/(Ai — 1) if Aj 7 ^ 1, and = A: if A* = 1. If 
all the eigenvalues of A are between 0 and 1 (0 < A* < 1 for every i) then the vector w{k) converges to the vector 
C'Diag(l/(l — Ai),..., 1/(1 — AN))C'b. If all the eigenvalues of A are 1, then w{k) = r(;(0) + kb. 

3.4 Examples 

We now give a few examples of learning equations with d < 1. 
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Constant and Linear Terms 

Expectation 

Matrix Form 

Ci (0,0) 

Ci 

c = (Ci) 



= ifJ-i) 

0 (1,1) 

Ej 

{w'fi)u = {pi'w)u = (Diag/u)r(; 

Wi (1,1) 

Wi 

W = {Wi) 

r(i,o) 

Ht 

fJ-TU 


Table 2: Constant and Linear Terms and their Expectations in Scalar and Vector Form. The table contains all the 
constant and linear terms of degree (n, d) equal to (0,0), (1,0), and (1,1) depending only on first order statistics of 
the data. The horizontal double line separates unsupervised terms (top) from supervised terms (bottom). The terms 
are sorted by increasing values of the effective degree (d), and then by increasing values of the apparent degree of Wi. 


Quadratic Terms 

Expectation 

Vector Form 

ifM 

Var/j + /if 

diag(Sii/) 

liO (2,1) 

Wi{YaiIi + /if) + Ejj^i Wj{Cov{Ii, Ij) + iii^Xj) 

{CovI)w + {^ii')w = Tijjiw 

Wih (2,1) 

Wi^i 

w o fj, = {Diagfi)w 

O^ (2,2) 

Ei (Var/j + /if) + 2wiWj{Cov{Ii, Ij) + /ij/i^) 

{w'T,jjiw)u 

WiO (2,2) 

'Wi Ej Wjfij 

{w'fi)w = {fi'w)w 

wf (2, 2) 

mf 

t(;(2) = (y;2) — yj Q yj 

I,T (2,0) 

Cov(/j,r) + /ij/iT 

Cov(/, T) + HTfi = ^IT' 

T2 (2,0) 

VarT + /i|, 

(VarT + /i|,)u 

or ( 2 , 1 ) 

X)j Wi[Cov{Ii,T) + mur] 

w'YjJJ'I 

w^T (2,1) 

WiHT 



Table 3: Quadratic Terms and their Expectations in Scalar and Vector Form. The table contains all the quadratic terms 
(n, d) where n = 2 and d = 0,1, or 2. These terms depend only on the first and second order statistics of the data. 
The horizontal double line separates unsupervised terms (top) from supervised terms (bottom). The terms are sorted 
by increasing values of the effective degree (d), and then by increasing values of the apparent degree of Wi. 

3.4.1 Unsupervised Simple Hebbian Rule 

As an example, consider the simple Hebb rule with Awj = rjIiO (n = 2, d = 1). Using Table[3]we get in vector form 
E{Aw) = rj'Ejj'W and thus 

wik) = {l + r]Eirfw{0) (22) 

In general, this will lead to weights that grow in magnitude exponentially with the number of epochs. For instance, if 
all the inputs have mean 0 (/r = 0), variance af, and are independent of each other, then 


Wi{k) = (1 + r]a‘f)^Wi{0) 

Alternatively, we can use the independence approximation to write 


(23) 


Wi{k) = Wi{k - l) + riE{0{k - l)Ii) « Wi{k - 1) + riiiiE{0{k - 1)) = Wi{k - 1) + ry/Zj ^ ^^(/c - l)//j (24) 

j 
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Simple Cubic Terms 

Expectation 

Vector Form 

Wilf (3,1) 

Wi{\SiTli + ixf) 

w o diagS/// = Diag(Tijji)w 

WiW (3, 2) 

Wi[wi{\w:Ii + lij) + + muj)] 

w o Eii'W 



•u;(2) o fj, = w o (Diag^)u; 

WiO"^ (3,3) 

Wi[Y^i 'W^i(Var/j + (uf) + Yli<j 2wiWj{Cov{Ii, Ij) + mfXj)] 

{w'T,jj/w)w 

wfO (3, 3) 

Ylj Wjl^j 

(Diagixw) o 

(3,3) 

wf 

= (ruf) = w o w o w 

WikT (3,1) 

Wi{Cov{Ii,T) + miiT) 

w o S/T’/ = diagS/T’/u; 

WiT‘^ (3,1) 

Wi{YaiT + ^y) 

(VarT + fx‘^)w 

WiOT (3, 2) 


w o [w'Eij'i) 

wfT (3,2) 


= fXTW o w 


Table 4: Cubic Terms and their Expectations in Scalar and Vector From. The table contains all the terms of degree 
(n, r) with n = 3 and r = 0,1, 2 or 3 that depend only on the first and second order statistics of the data. The 
horizontal double line separates unsupervised terms (top) from supervised terms (bottom). The terms are sorted by 
increasing values of the effective degree (d), and then by increasing values of the apparent degree of Wi. 


Simple 

n-th Terms 

Expectation 

Vector Form 

1 

1 

<-2(Var/, + fif) 


w'l "^liO (n,n - 1) 

w^~‘^[wi{YarIi + iJ,f) + Wj{Cov{Ii, Ij) + mfij)] 

Q YijjiW 

{n,n- 1) 


o ^ Q (Diag^)u; 

(n, n) 

^t^i(Var/j + ^2) + 2wiWj{Cov{Ii, Ij) + muj)] 


w^~^0 (n, n) 

Ej 

o (Diag//)ru 

(n, n) 

< 

= (m”) = w o ... o w 

(n,n-2) 

wf'~'^\Cow{Ii,T) + hhxt) 

o 

1 

s 

(n, n-2) 

<-2(Varr + ^ll) 

(VarT + ;u 2 )y;(n- 2 ) 

w^~^OT (n,n — 1) 

wr^E{hT) + 

Q {w'TiJT') 

w^~^T (n, n — 1) 

k-T 

= kTW O . . . OW 


Table 5: Simple Terms of Order n and their Expectations in Scalar and Vector Form. The table contains all the terms 
of degree (n, d) with d = n — 2, n — 1, orn that depend only on the first and second order statistics of the data. The 
horizontal double line separates unsupervised terms (top) from supervised terms (bottom). The terms are sorted by 
increasing values of the effective degree (d), and then by increasing values of the apparent degree. 

which, in vector form, gives the approximation 

w{k) = w{k — 1) + riij,'w{k — l)fi = (I + r]A)w{k — 1) or w{k) = {1 + riA)^w{0) (25) 

where A = 
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3.4.2 Supervised Simple Hebbian Rule (Clamped Case) 

As a second example, consider the supervised version of the simple Hebb rule where the output is clamped to some 
target value T with Aru, = rjIiT {n = 2, d = 0). Using Table|3]we get in vector form E{Aw) = rjTijx' and thus 


w{k) = r(;(0) + rjkEiT' (26) 

In general the weights will grow in magnitude linearly with the number k of epochs, unless E{IiT) = 0 in which 
case the corresponding weight remains constant Wi{k) = Wi{0). 

Note that in some cases it is possible to assume, as a quick approximation, that the targets are independent of the 
inputs so that E{Awi) = rjE{T)E{Ii) = rjE{T)^i. This simple approximation gives 


w{k) = r(;(0) + r]kE(T)fi (27) 

Thus the weights are growing linearly in the direction of the center of gravity of the input data. 

Thus, in the linear setting, many local learning rules lead to divergent weights. There are notable exceptions, 
however, in particular when the learning rule is performing some form of (stochastic) gradient descent on a convex 
objective function. 

3.4.3 Simple Anti-Hebbian Rule 

The anti-Hebbian quadratic learning rule Awi = —rjIiO (n = 2, d = 1) performs gradient descent on the objective 
function ^ Ylt "'ll! *^0 converge to the uninteresting solution where all weights (bias included) are equal 

to zero. 


3.4.4 Gradient Descent Rule 

A more interesting example is provided by the rule Awi = r]{T — 0)li {n = 2, d = 1). Using Table [3] we get in 
vector form E{Aw) = tjCEjt' — Ejj/). The rule is convergent (with properly decreasing learning rate rf) because it 
performs gradient descent on the quadratic error function 2 ~ converging in general to the linear 

regression solution. 

In summary, when d < 1 the dynamics of the learning rule can be solved exactly in the linear case and it is entirely 
determined by the statistical moments of the data, in particular by the means, variances, and covariances of the inputs 
and targets (e.g. when n < 2). 


3.5 The Case d> 2 


When the effective degree of the weights is greater than one in the learning rule, the recurrence relation is not linear 
and there is no systematic solution in the general case. It must be noted however that in some special cases, this can 
result in a Bernoulli or Riccati {d = 2) differential equation for the evolution of each weight which can be solved (e.g. 
14711'). For reasons that will become clear in later sections, let us for instance consider the learning equation 


Awi = r]{l - wf)Ii 
with n = 3 and d = 2. We have 


(28) 


E{Awi) = ??(1 - Wi)ni 

Dropping the index i, the corresponding Riccati differential equation is given by 


(29) 


dw 2 

— =r]fj,- rinw 


(30) 


The intuitive behavior of this equation is clear. Suppose we start at, or near, tu = 0. Then the sign of the derivative 
at the origin is determined by the sign of and w will either increase and asymptotically converge towards +1 
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when /X > 0, or decrease and asymptotically converge towards -1 when /x < 0. Note also that Wobvi{t) = 1 and 
Wobv 2 {t) = — 1 are two obvious constant solutions of the differential equation. 

To solve the Riccati equation more formally we use the known obvious solutions and introduce the new variable 
z = l/{w — Wohv) = l/(ru + 1 ) (and similarly one can introduce the new variable z = l/{w — 1) to get a different 
solution). As a result, w = {1 — z)lz. It is then easy to see that the new variable z satisfies a linear differential 
equation. More precisely, a simple calculation gives 


dz 

dt 


—2r]fiz + T]fi 


resulting in 


z{t) = i 

and thus 


(31) 


(32) 


1 _ 2/7p-2r?Mi 

w{t) = -— 5—7 with w{0) 


1 - 2C 1 - m(0) 

1 + 2C “2(l + t(;(0)) 


Simulations are shown in Figure [5^ for the unsupervised case, and Figure 1531 for the supervised case. 


3.5.1 Oja’s Rule 


An important example of a rule with r > 1 is provided by Oja’s rule ll44|] 


(33) 


Awi = r]{OIi - O'^Wi) (34) 

with d = 3, originally derived for a linear neuron. The idea behind this rule is to control the growth of the weights 
induced by the simple Hebb rule by adding a decay term. The form of the decay term can easily be obtained by requir¬ 
ing the weights to have constant norm and expanding the corresponding constraint in Taylor series with respect to 77 . 
It can be shown under reasonable assumptions that the weight vector will converge towards the principal eigenvector 
of the input correlation matrix. Converging learning rules are discussed more broadly in Section |5] 

4 Local Learning in the Non-Linear Case 

To extend the theory to the non-linear case, we consider a non-linear unit O = f{S) = where / is 

a transfer function that is logistic (0,1) or hyperbolic tangent (-1,1) in the differentiable case, or the corresponding 
threshold functions. All the expectations computed in the linear case that do not involve the variable O can be 
computed exactly as in the linear case. Furthermore, at least in the case of threshold gates, we can easily deal with 
powers of O because = O in the (0,1) case, and = 1 in the (-1,1) case. Thus, in essence, the main challenge is 
to compute terms of the form E{0) and E{OIi) when O is non-linear. We next show how these expectations can be 
approximated. 

4.1 Terms m E{0) and the Dropout Approximation 

When the transfer function is a sigmoidal logistic function a = crjo 1 ], we can use the approximation |@] 

E{0) = E{a{S)) ^ <j{E{S)) with \E{<j{S)) - a{E{S))\ <2E{1 - E)\l - 2E\ (35) 

where E = E{0) = E{a{S)) and V = Var(O) = Var((T(S')). Thus E{0) ^ During learning, as 

the Wi vary, this term could fluctuate. Note however that if the data is centered {^i = 0 for every i), which is often 
done in practice, then we can approximate the term E{0) by a constant equal to fT(0) across all epochs. Although 
there are cases where the approximation of Equation [35] is not precise, in most reasonable cases it is quite good. 
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This approximation has its origin in the dropout approximation E{0) ps NWGM{0) = cr{E{^) where NWGM 
represents the normalized geometric mean. These and several other related results are proven in 191]. 

When the transfer function is a hyperbolic tangent function we can use the same approximation 


E{0) = £^(tanh(5)) « tanh(£^(S')) 
This is simply because 


(36) 


tanh(5) = 2cj(25) - 1 (37) 

Equation [35] is valid not only for the standard logistic function, but also for any logistic function with slope A of the 
form (j{S) = 1/(1 + ce~^^). Threshold functions are approximated by sigmoidal functions with A —)• +oo. Thus 
the approximation can be used also for threshold functions with A +oo, with similar caveats. More generally, if 
the transfer function / is differentiable and can be expanded as a Taylor series around the mean E{S), we always 
have: f{S) « f{E{S)) + fiE{S)){S - E{S)) + y"{E{S)){S - E{S)f and thus E{f{S)) « f{E{S)) + 
^f"{E{S))VarS. Thus if VarS is small or f”{E{S)) is small, then E{f{S)) « f{E{S)) = The 

approximations can often be used also for other functions (e.g. rectified linear), as discussed in |@]. 

4.2 Terms in E{OIi) 

Next, in the analysis of learning rules in the non-linear case, we must deal with expectations of the form E{OIi). A 
first simple approximation is to assume that O and li are almost independent and therefore 


E{OIi) « E{0)E{h) = E{0)^i^ (38) 

In this expression, E{0) can in turn be approximated using the method above. For instance, in the case of a logistic 
or tanh transfer function 


N 

E{OIi) « E{0)E{Ii) = E{0)iJ, ^ liia{E{S)) = 

i=l 


(39) 


If the data is centered, the approximation reduces to 0. 

A second possible approximation is obtained by expanding the sigmoidal transfer function into a Taylor series. 
To a first order, this gives 


E{OIi) = E 


= E 

cr(^WjIj + Wili)li 

^ E 

WjIj)WiIiIi 


j 


j¥=i 




(40) 


with the approximation quality of a first-order Taylor approximation to cr. To further estimate this term we need to 
assume that the terms depending on j but not on i are independent of the terms dependent on i, i.e. that the data 
covariances are 0. In this case. 


E{OIi) « E{a(^ WjIj))E{Ii)+E{a'(^ WjIj)wiE{l'^)) « ^iia(^ Wj^ij)+E{a'(^ WjIj))wi{^ij+a'j) (41) 

jA* 

where af = Varli and the latter approximation uses again the dropout approximation. If in addition the data is 
centered {ni = 0 for every i) we have 

E{OIi) « E{a'{Y, Wjlj))wiaf (42) 
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which reduces back to a linear term in w 


E{OIi) « E{a'{0))wiaf (43) 

when the weights are small, the typical case at the beginning of learning. 

In summary, when n < 2 and r < 1 the dynamics of the learning rule can be solved exactly or approximately, 
even in the non-linear case, and it is entirely determined by the statistical moments of the data. 

4.3 Examples 

In this section we consider simple local learning rules applied to a single sigmoidal or threshold unit. 

4.3.1 Unsupervised Simple Hebb Rule 

We first consider the simple Hebb rule Awi = rjIiO. Using the approximations described above we obtain 


E{Awi) « r]fiiE{0) « thus Wi{k) 

i 


Wi{0) + rjHi 


k-l 

1=0 j 


(44) 


Thus the weight vector tends to align itself with the center of gravity of the data. However, this provides only a 
direction for the weight vector which continues to grow to infinity along that direction, as demonstrated in Figure l4~T] 



Figure 4.1: Single unit trained on the MNIST data set (60,000 examples) for 500 epochs, with a learning rate of 0.001 
using the simple Hebb rule in unsupervised fashion. The fan-in is 784 (28 x 28). The weights are initialized from 
a normal distribution with standard deviation 0.01. Left: Angle of the weight vector to the center of gravity. Right: 
Norm of the weight vector. 


4.3.2 Supervised Simple Hebb Rule 

Here we consider a single [-1,1] sigmoidal or threshold unit trained using a set of M training input-output pairs 
I{t),T{t) where T{t) = ±1. In supervised mode, here the output is clamped to the target value (note in general this 
is different from the perceptron or backpropagation rule). 

Here we apply the simple Hebb rule with the output clamped to the target so that Awi = r]IiT. Thus the 
expectation E{Awi) = r]E{IiT) is constant across all the epochs and depends only on the data moments. In general 
the weights will grow linearly with the number of epochs, unless E{Awi) = 0 in which case the Wi will remain 
constant and equal to the initial value r(;j(0). In short. 


Wi{k) = mi(0) -h kr]E{IiT) 


(45) 
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If the targets are essentially independent of the inputs, we have E{Awi) « rjE{Ii)E{T) = r]HiE{T), and thus after 
k learning epochs the weights are given by 


Wi{k) = Wi{0) + krj^ifiT (46) 

In this case we see again that the weight vector tends to be co-linear with the center of gravity of the data, with a sign 
that depends on the average target, and a norm that grows linearly with the number of epochs. 

4.3.3 Gradient Descent Rule 

A last example of a convergent rule is provided by Awi = r]{T — 0)li with the logistic transfer function. The rule 
is convergent (with properly decreasing learning rate t]) because it performs gradient descent on the relative entropy 
error function E^rriw) = — logO(f) + (1 — T{t)) log(l — 0{t)). Remarkably, up to a trivial scaling 

factor of two that can be absorbed into the learning rate, this learning rule has exactly the same form when the tanh 
function is used over the [—1,1] range (Appendix B). 


5 Derivation of New Learning Rules 


The local learning framework is also helpful for discovering new learning rules. In principle, one could recursively 
enumerate all polynomial learning rules with rational coefficients and search for rules satisfying particular properties. 
However this is not necessary for several reasons. In practice, we are only interested in polynomial learning rules 
with relatively small degree (e.g. n < 5) and more direct approaches are possible. To provide an example, here we 
consider the issue of convergence and derive new convergent learning rules. 

We first note that a major concern with a Hebbian rule, even in the simple case Awij oc OtOj, is that the weights 
tend to diverge over time towards very large positive or very large negative values. To ensure that the weights remain 
within a finite range, it is natural to introduce a decay term so that Awij oc OiOj — Cwij with C > 0. The decay 
coefficient can also be adaptive as long as it remains positive. This is exactly what happens in Oja’s cubic learning 
rule 1^] 


Awij oc OiOj — Ofwij (47) 

which has a weight decay term Ofwij proportional to the square of the output and is known to extract the principal 
component of the data. Using different adaptive terms, we immediately get new rules such as: 


Awij oc OiOj — O'jWij 


(48) 


and 


Awij oc OiOj - {OiOjfwij = OiOj{l - OiOjWij) (49) 

And when the postsynaptic neuron has a target Ti, we can consider the clamped or gradient descent version of these 
rules. In the clamped cases, some or all the occurrences of Oi in Equations |47j |48l and |42 are to be replaced by the 
target Ti. In the gradient descent version, some or all the occurrences of Oi in Equations WT\ l48l and|49]are to be 
replaced by {Ti — Oi). The corresponding list of rules is given in Appendix C. 

To derive additional convergent learning rules, we can take yet a different approach by introducing a saturation 
effect on the weights. To ensure that the weights remain in the [—1,1] range, we can assume that the weights are 
calculated by applying a hyperbolic tangent function. 

Thus consider a [-1,1] system trained using the simple Hebb rule Awij oc OiOj. To keep the weights in the [-1,1] 
range throughout learning, we can write: 


Wij{t -I- 1) = tanh[r(;jj(0) -|- 7]Oi{l)Oj{l) + ... r]Oi{t)Oj{t) + r]Oi{t -|- l)Oj{t -|- 1)] 


(50) 
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where rj is the learning rate. By taking a first order Taylor expansion and using the fact that tanh(x)' = 1 — tanh^ (rr), 
we obtain the new rule 


Wij{t+ 1) = w{t) + ri{l - w‘fj)Oi{t)Oj{t) or l^Wij (x {I - wfj)OiOj (51) 

Note that while simple, this is a quartic learning rule in the local variables with n = 4 and d = 3. The rule forces 
Awij —)■ 0 as \wij\ —)■ 1. In the supervised case, this rule becomes 


Awij oc (1 - wfj)TiOj 

in the clamped setting, and 


(52) 


Awij oc (1 - wfj){Ti - Oi)Oj (53) 

in the gradient descent setting. 

To further analyze the behavior of this rule in the clamped setting, for instance, let us consider a single tanh or 
threshold [-1,1] unit with Am* = r]{l — w^)TIi. In the regime where the independence approximation is acceptable, 
this yields E{Awi) = r]{l — 'uP‘)E{T)iJii which is associated with the Riccati differential equation that we already 
solved in the linear case. One of the solutions (converging to -i-l) is given by 


w{k) 


1 - 2C'e-2wi?(i)fc ■ , ^ 1 - 

^ ~ 2(l + m(0)) 


(54) 


Simulations of these new rules demonstrating how they effectively control the magnitude of the weights and how well 
the theory fits the empirical data are shown in Figures ISTTl 15.211531 and 15.41 

Finally, another alternative mechanism for preventing unlimited growth of the weights is to reduce the learning 
rate as learning progresses, for instance using a linear decay schedule. 


6 What is Learnable by Shallow or Deep Local Learning 

The previous sections have focused on the study of local learning rules, stratified by their degree, in shallow networks. 
In this section, we begin to look at local learning rules applied to deep feedforward networks and partially address the 
question of what is locally learnable in feedforward networks. 

Specifically we want to consider shallow (single adaptive layer) local learning, or deep local learning defined as 
learning in deep feedforward layered architectures with learning rules of the form Awij = E{Oi,Oj,Wij) applied 
successively to all the layers, starting from the input layer, possibly followed by supervised learning rule of the form 
Awij = F{Oi,Oj,Ti,Wij) in the top layer alone, when targets are available for the top layer (Figure [63b . One 
would like to understand what input-output functions can be learnt from examples using this strategy and whether this 
provides a viable alternative to back-propagation. We begin with a few simulation experiments to further motivate the 
analyses. 

6.1 Simulation Experiments: Learning Boolean Functions 

We conduct experiments using various local learning rules to try to learn Boolean functions with small fan in archi¬ 
tectures with one or two adaptive layers. These experiments are purposely carried to show how simulations run in 
simple cases can raise false hopes of learnability by local rules that do not extend to large fan in and more complex 
functions, as shown in a later section. Specifically, we train binary [-1,1] threshold gates to learn Boolean functions of 
up to 4 inputs, using the simple Hebb rule, the Oja rule, and the new rule corresponding to Equation [ST] and its super¬ 
vised version. Sometimes multiple random initializations of the weights are tried and the function is considered to be 
learnable if it is learnt in at least one case. [Note: The number of trials needed is not important since the ultimate goal 
is to show that this local learning strategy cannot work for more complex functions, even when a very large number of 
trials is used.] In Tables [6] and |7j we report the results obtained both in the shallow case (single adaptive layer) trained 
in a supervised manner, and the results obtained in the deep case (two adaptive layers) where the adaptive input layer 
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Figure 5.1: Temporal evolution of the norm of the weight vector of a single threshold gate with 20 inputs and a bias 
trained in supervised mode using 500 randomly generated training examples using three different learning rules: Basic 
Hebb, Oja, and the New Rule. Oja and the New Rule gracefully prevent the unbounded growth of the weights. The 
New Rule produces a weight vector whose component are fairly saturated (close to -1 or 1) with a total norm close to 

V2I. 
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Figure 5.2: A learning rule that results in a Riccati differential equation. The solution to this Riccati equation tells 
us that all the weights will converge to 1. A typical weight is is shown. It is initialized randomly from N{0, 0.1) 
and trained on 1000 MNIST resulting in a fan-in of 784 (28 x 28). There is almost perfect agreement between the 
theoretical and empirical curve. 


is trained in unsupervised manner and the adaptive output layer is trained in supervised manner. In the experiments, 
all inputs and targets are binary (-1,1), all the units have a bias, and the learning rate decays linearly. 
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Figure 5.3: When the independence assumption is reasonable, the Riccati equation describes the dynamics of learning 
and can be used to find the exact solution. The typical weight shown here is randomly initialized from A^(0,0.1) and 
is trained on M = 1000 MNIST samples to recognize digits 0-8 vs 9 classes. N = 784. 
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Figure 5.4: A single neuron with tanh activation trained to recognize the handwritten digit nine with five supervised 
learning rules. The input data is 100 MNIST images (made binary by setting pixels to -i-l if the greyscale value 
surpassed a threshold of 0.2, and -1 otherwise), and binary -l,-i-l targets. Weights were initialized independently from 
N(0, 0.1), and updated with learning rate rj = 0.1. 


As shown in Tabled 14 of the 16 possible Boolean functions of two variables {N = 2) can be learnt using the 
Simple Hebb, Oja, and new rules. The two Boolean functions that cannot be learnt are of course XOR and its converse 
which cannot be implemented by a single layer network. Using deep local learning in two-layer networks, then all 
three rules are able to learn all the Boolean functions with N = 2, demonstrating that at least some complex functions 
can be learnt by combining unsupervised learning in the lower layer with supervised learning in the top layer. Similar 


19 



















results are also seen for N = 3, where 104 Boolean functions, out of a total of 256, are learnable in a shallow network. 
And all 256 functions are learnable by a two-layer network by any of the three learning rules. 

Table |7] shows similar results on the subset of monotone Boolean functions. As a reminder, a Boolean function 
is said to be monotone if increasing the total number of -i-l in the input vector can only leave the value of the output 
unchanged or increase its value from -1 to -i-l. Equivalently, it is the set of Boolean functions with a circuit comprising 
only AND and OR gates. There are recursive methods for generating monotone Boolean functions and the total 
number of monotone Boolean functions is known as the Dedekind number. For instance, there are 168 monotone 
Boolean functions with = 4 inputs. Of these, 150 are learnable by a single unit trained in supervised fashion, and 
all 168 are learnable by a two-layer network trained with a combination of unsupervised (input layer) and supervised 
(output layer) application of the three local rules. 


Fan 

Functions Learnt 

Total Number 

Rule 

In 

Shallow 

Deep 

of Functions 


2 

14 

16 

16 

Simple Hebb 
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14 

16 

16 

Oja 
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14 

16 

16 

New 
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104 

256 

256 

Simple Hebb 

3 

104 

256 

256 

Oja 
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104 

256 

256 

New 


Table 6: Small fan-on Boolean functions learnt by deep local learning. 
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Total Number 
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2 

6 

6 

6 

Simple Hebb 

2 

6 

6 

6 

Oja 

2 

6 

6 

6 

New 
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20 

20 

20 

Simple Hebb 

3 

20 

20 

20 

Oja 

3 

20 

20 

20 

New 

4 

150 

168 

168 

Simple Hebb 

4 

150 

168 

168 

Oja 

4 

150 

168 

168 

New 


Table 7: Small fan-in monotone Boolean functions learnt by deep local learning. 

In combination, these simulations raise the question of what are the classes of functions learnable by shallow or 
deep local learning, and raise the (false) hope that purely local learning may be able to replace backpropagation. 

6.2 Learnability in Shallow Networks 

Here we consider in more detail the learning problem for a single [-1,1] threshold gate, or perceptron. 
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6.2.1 Perceptron Rule 


In this settin g, th e problem has already been solved at least in one setting by the perceptron learning algorithm and 
theorem i49l. 14211 . Obviously, by definition, a threshold gate can only implement in an exact way functions (Boolean 
or continuous) that are linearly separable. The perceptron learning algorithm simply states that if the data is linearly 
separable, the local gradient descent learning rule Awi = r]{T — 0)li will converge to such a separating hyperplane. 
Note that this is true also in the case of [0,1] gates as the gradient descent rule as the same form in both systems. When 
the training data is not linearly separable, the perceptron algorithm is still well behaved in the sense that algorithm 
converges to a relatively small compact region |^, 14|, 24]. Here we consider similar results for a slightly different 
supervised rule, the clamped form of the simple Hebb rule: Awi = rjTIi . 


6.2.2 Supervised Simple Hebb Rule 

Here we consider a supervised training set consisting of input-target pairs of the form S = {{I{t),T{t)) : t = 
1,... ,M] where the input vectors I (t) are A^-dimensional (not-necessarily binary) vectors with corresponding targets 
T(f) = ±1 for every t (Figure IhTl) . S is linearly separable (with or without bias) if there is a separating hyperplane, 

i.e. set of weights w such that r(/(f)) = t(^ Wili{t)) = T{t) (with or without bias) for every t, where r is the ±1 
threshold function. To slightly simplify the notation and analysis, throughout this section, we do not allow ambiguous 
cases where t{I) =0 for any I of interest. In this framework, the linearly separable set S is leamable by a given 
learning rule R (R-learnable) if the rule can find a separafing hyperplane. 

The Case Without Bias: When there is no bias (wq = 0), then t(—I) = —t{I) for every I. In this case, a set S 
is consistent if for every ti and t 2 . I{ti) = —I{t 2 ) ^(^i) = —T{t 2 ). Obviously consistency is a necessary 

condition for separability and leamability in the case of 0 bias. When the bias is 0, the training set S can be put into 
its canonical form by ensuring that all targets are set to +1, replacing any training pair of the form —1) by 
the equivalent pair (T(t)/(f), +1) = (—/(t), +1). Thus the size of a learnable canonical training set in the binary 
case, where Ii{t) = ±1 for every i and t, is at most 2^“^. 

We now consider whether S is leamable by the supervised simple Hebb rule (SSH-leamable) corresponding to 
clamped outputs Awi = rjliT, first in the case where there is no bias, i.e. wq = 0. We let Cos denote the M x M 
symmetric square matrix of cosine values Cos = (Cosuv) = {cos{T{u)I{u),T{v)I{v))) = (cos(C(n), C(r;))). 
It is easy to see that applying the supervised simple Hebb rule with the vectors in S is equivalent to applying the 
supervised simple Hebb rule with the vectors in 5'^, both leading to the same weights. If S is in canonical form and 
there is no bias, we have the following properties. 

Theorem: 

1. The supervised simple Hebb rule leads to Awi = r]E{IiT) = r]E{If) = rjp^ and thus w{k) = r(;(0) + rjkp'^. 

2. A necessary condition for S to be SSH-learnable is that S (and equivalently S‘^ ) be linearly separable by a 
hyperplane going through the origin. 

3. A sufficient condition for S to be SSH-learnable from any set of starting weights is that all the vectors in be 
in a common orthant, i.e. that the angle between any E{u) and E{c) lie between 0 and tt/2 or, equivalently, 
that 0 < cos{E{u), E{v)) < 1 for any u and v. 

4. A sufficient condition for S to be SSH-learnable from any set of starting weights is that all the vectors in 5(or 
equivalently in S‘^) be orthogonal to each other, i.e. I{u)I{v) = 0 for any u ^ v. 

5. If all the vectors I(t) have the same length, in particular in the binary ±1 case, S is SSH-learnable from any set 
of initial weights if and only if the sum of any row or column of the cosine matrix associated with 5'^ is strictly 
positive. 

Proof: 

1) Since 5'^ is in canonical form, all the targets are equal to -i-l, and thus E{Ii{t)T{t)) = E{If{t)) = After k 
learning epochs, with a constant learning rate, the weight vector is given by w{k) = w{0) -|- r]kp,‘^. 

2) This is obvious since the unit is a threshold gate. 

3) For any u, the vector J^tt) has been learnt after k epochs if and only if 

N N / M \ 

Y^[wi{0) + r]kE{Ii{t)T{t))]Ii{u) = ^ j u;i(0)/f (u) -f r/A;—^If(t)/f (u) j > 0 (55) 

i=l i=l \ ^ t=l / 
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Here we assume a constant positive learning rate, so after a sufficient number of epochs the effect of the initial 
conditions on this inequality can be ignored. Alternatively one can examine the regime of decreasing learning rates 
using initial conditions close to 0. Thus ignoring the transient effect caused by the initial conditions, and separating 
the terms corresponding to u, I(u) will be learnt after a sufficient number of epochs if and only if 


N M M M 

= y]/'(()/'(«) = yy ||r(()iiiir(u)||cos(r(i),r(i.)) 

i=l t=l t=l t=l 

= \nu)\^ + Y,\\nmmu)\\cos{nt)j^{u)) > o ( 56 ) 

t^u 

Thus if all the cosines are between 0 and 1 this sum is strictly positive (note that we do not allow I{u) = 0 in the 
training set). Since the training set is finite, we simply take the maximum number of epochs over all training examples 
where this inequality is satisfied, fo offsef fhe inifial conditions. Nofe fhaf fhe expression in Equafion|56]is invarianf 
wifh respecf fo any fransformafion fhaf preserves vecfor lengfhs and angles, or changes fhe sign of all or some of fhe 
angles. Thus if is invarianf wifh respecf fo any rofafions, or symmetries. 

4) This is a special case of 3, also obvious from Equation |56] Note in particular that a set of aN (0 < a < 1) 
vectors chosen randomly (e.g. uniformly over the sphere or with fair coin flips) will be essenfially orthogonal and 
fhus leamable wifh high probabilify when N is large. 

5) If all fhe fraining vectors have fhe same lengfh A (wifh A = y/N in fhe binary case). Eguafion 1561 simply becomes 

M 

A^ ^ cos{P{u),r{t)) > 0 (57) 

t=i 

and fhe properfy is fhen obvious. Nofe fhaf if is easy fo consfrucf counferexamples where fhis properfy is nof frue 
if fhe fraining vectors do nof have fhe same lengfh. Take, for insfance, S = {(/(I), +1), (1(2), +1)} wifh 7(1) = 
(1, 0,0,... , 0) and 7(2) = (—e, 0,0 ... , 0) for some small e > 0. 

The Case With Adaptive Bias: When there is a bias (wq is not necessarily 0), starting from the training set S = 
{(7(f), T(f))} we first modify each vector 7(f) into a vector 7'(f) by adding a zero-th component equal to +1, so that 
7Q(f) = +1, and 7'(f) = 7i(f) otherwise. Einally, we construct the corresponding canonical set 5'^ as in the case of 0 
bias by letting 5^ = {(7‘^(f), +1)} = {(T(f)7'(f), +1)} and apply the previous results to S'^. It is easy to check that 
applying the supervised simple Hebb rule with the vectors in S is equivalent to applying the supervised simple Hebb 
rule with the vectors in S^, both leading to the same weights. 

Theorem: 

1. The supervised simple Hebb rule applied to leads to Awi = rjE{I'jT) = r]E{If) = and thus w{k) = 
w{0) + The component /Tq is equal to the proportion of vectors in S with a target equal to +1. 

2. A necessary condition for S to be SSH-learnable is that be linearly separable by a hyperplane going through 
the origin in N + 1 dimensional space. 

3. A sufficient condition for S to be SSH-learnable from any set of starting weights is that all the vectors in S‘^ be 
in a common orthant, i.e. that the angle between any E(u) and E(v) lie between 0 and 7r/2 or, equivalently, 
that 0 < cos{E{u), E{v)) < 1 for any u and v. 

4. A sufficient condition for S to be SSH-leamable from any set of starting weights is that all the vectors in be 
orthogonal to each other, i.e. E{u)E{v) = 0 for any u ^ v. 

5. If all the vectors E{t) have the same length, in particular in the binary ±1 case, S is SSH-leamable from any set 
of initial weights if and only if the sum of any row or column of the cosine matrix Cos = {cos{E(u), E{v))) 
is strictly positive. 

Proof: 

The proofs are the same as above. Note that /Tq = E{lQ{t)) = E{lQ{t)T{t)) = E(T{t)). 
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Figure 6.1: Examples of supervised simple Hebb learning with different training set properties. The linearly separable 
data is a random matrix of binary -1,+1 values (p=0.5) of shape M=10, N=10, with binary -1,1 targets determined 
by a random hyperplane. The orthogonal dataset is simply the identity matrix (multiplied by the sealar vTO) and 
random binary -l,-i-l targets. The eommon orthant dataset was created by sampling the features in each column from 
either [—1,0) or (0,1], then setting all the targets to -i-l. The weights were initialized independently from N(0,1), and 
weights were updated with the learning rate r/ = 0.1. 

6.3 Limitations of Shallow Local Learning 

In summary, strictly local learning in a single threshold gate or sigmoidal function can learn any linearly separable 
function. While it is as powerful as the unit allows it to be, this form of learning is limited in the sense that it can 
learn only a very small fraction of all possible functions. This is because the logarithm of the size of the set of all 
possible Boolean functions of N variables is exponential and equal to 2^, whereas the logarithm of the size of the 
total number of linearly separable Boolean functions scales polynomially like iV^. Indeed, the total number Tat of 
threshold functions of N variables satisfies 


N{N-l)/2 < logaT^v < 


(58) 


(see 116 IL ll8L I43L 1^ and references therein). The same negative result holds also for the more restricted class of 
monotone Boolean functions, or any other class of exponential size. Most monotone Boolean functions cannot be 
learnt by a single linear threshold unit because the number Mat of monotone Boolean functions of N variables, 
known as the Dedekind number, satisfies 0311 


(W2j) (l + OdogiV/iV)) (59) 

These results are immediately true also for polynomial threshold functions, where the polynomials have bounded 
degree, by similar counting arguments ijsl]. In short, linear or bounded-polynomial threshold functions can at best 
learn a vanishingly small fraction of all Boolean functions, or any subclass of exponential size, regardless of the 
learning rule used for learning. 

The fact that local learning in shallow networks has significant limitations seems to be a consequence of the 
limitations of shallow networks, which are simply not able to implement complex function. This alone, does not 
preclude the possibility that iterated shallow learning applied to deep architectures, i.e. deep local learning, may be 
able to learn complex functions. After all this would be consistent with what is observed in the simple simulations 
described above where the XOR function, which is not leamable by a shallow networks, becomes leamable by local 
rules in a network of depth two. Thus over the years many attempts have been made to seek efficient, and perhaps 
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more biologically plausible, alternatives to backpropagation for learning complex data using only local rules. For 
example, in one of the simplest cases, one could try to learn a simple two-layer autoencoder using unsupervised local 
learning in the first layer and supervised local learning in the top layer. More broadly, one could for example try to 
learn the MNIST benchmark BSll data using purely local learning. Simulations show (data not shown) however that 
such schemes fail regardless of which local learning rules are used, how the learning rates and other hyperparameters 
are tuned, and so forth. In the next section we show why all the attempts that have been made in this direction are 
bound to fail. 


6.4 Limitations of Deep Local Learning 

Consider now deep local learning in a deep layered feedforward architecture (Figure 16.21) with L -|- 1 layers of size 
A^O) -^ 1 ) • • • where layer 0 is the input layer, and layer L is the output layer. We let denote the activity of 
unit i in layer h with = f{S^) = fiYlj non-linear processing units can be fairly arbitrary. For 

this section, it will be sufficient to assume that the functions / be differentiable functions of their synaptic weights 
and inputs. It is also possible to extend the analysis to, for instance, threshold gates by taking the limit of very steep 
differentiable sigmoidal functions. We consider the supervised learning framework with a training set of input-output 
vector pairs of the form (/(f), T(t)) for f = 1,..., M and the goal is to minimize a differentiable error function Eerr- 
The main learning constraint is that we can only use deep local learning (Figure |6]2]). 
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Figure 6.2: Deep local learning. Local learning rules are used for each unit. For all the hidden units, the local learning rules are 
unsupervised and thus of the form Aryk = F(0^, wk). For all the output units, the local learning rules can be supervised 

since the targets are considered as local variables and thus of the form Aw/ = F(T, O^, ,w^). 


Fact: Consider the supervised learning problem in a deep feedforward architecture with differentiable error function 
and transfer functions. Then in most cases deep local learning cannot find weights associated with critical points of 
the error functions, and thus it cannot find locally or globally optimal weights. 

Proof: If we consider any weight in a deep layer h (i.e. 0 < h < 1), a simple application of the chain rule (or the 
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backpropagation equations) shows that 


dE, 


M 




= E 




(60) 


t=l 


where B^{t) is the backpropagated error of unit i in layer h, which depends in particular on the targets T{t) and the 
weights in the layers above layer h. Likewise, is the presynaptic activity of unit j in layer h — 1 which 

depends on the inputs I{t) and the weights in the layers below layer h — 1. In short, the gradient is a sum over all 
training examples of product terms, each product term being the product of a target-dependent term with an input- 
dependent term. [The target-dependent term depends explicitly also on all the descendant weights of unit i in layer h, 
and the input-dependent term depends also on all the ancestors weights of unit j in layer h — 1.] Asa result, in most 
cases, the deep weights wfj, which correspond to a critical point where dE^rr/dwfj = 0, must depend on both the 
inputs and the targets, as well as all the other weights. In particular, this must be true at any local or global optimum. 
However, using any strictly local learning scheme all the deep weights (h < L) depend on the inputs only, and 
thus cannot correspond to a critical point. 

In particular, this shows that applying local Hebbian learning to a feedforward architecture, whether a simple 
autoencoder architecture or Fukushima’s complex neocognitron architecture, cannot achieve optimal weights, regard¬ 
less of which kind of local Hebbian rule is being used. For the same reasons, an architecture consisting of a stack 
of autoencoders trained using unlabeled data only 12711281 ll2L IllL l20ll cannot be optimal in general, even when the 
top layer is trained by gradient descent. It is of course possible to use local learning, shallow or deep autoencoders. 
Restricted Boltzmann Machines, and so forth to compress data, or to it initialize the weights of a deep architecture. 
However, these steps alone cannot learn complex functions optimally because learning a complex function optimally 
necessitates the reverse propagation of information from the targets back to the deep layers. 

The Fact above is correct at a level that would satisfy a physicist and is consistent with empirical evidence. It 
is not completely tight from a mathematical standpoint due to the phrase “in most cases”. This expression is meant 
to exclude trivial cases that are not important in practice, but which would be difficult to capture exhaustively with 
mathematical precision. These include the case when the training data is trivial with respect to the architecture (e.g. 
M = 1) and can be loaded entirely in the weights of the top layer, even with random weights in the lower layers, or 
when the data is generated precisely with an artificially constructed architecture where the deep weights depend only 
on the input data, or are selected at random. 

This simple result has significant consequences. In particular, if a constrained feedforward architecture is to be 
trained on a complex task in some optimal way, the deep weights of the architecture must depend on both the training 
inputs and the target outputs. Thus in any physical implementation, in order to be able to reach a locally optimal 
architecture there must exist a physical learning channel that conveys information about the targets back to the deep 
weights. This raises three sets of questions regarding: (1) the nature of the backward learning channel; and (2) the 
nature of the information being transmitted through this channel; and (3) the rate of the backward learning channel. 
These questions will be addressed in Section [8] We now focus on the information about the targets that is being 
transmitted to the deep layers. 


7 Local Deep Learning and Deep Targets Algorithms 

7.1 Definitions and their Equivalence 

We have seen in the previous section that in general in an optimal implementation each weight must depend on 
both the inputs I and the targets T. In order for learning to remain local, we let (T) denote the information about 
the targets that is transmitted from the output layer to the weight for its update by a corresponding local learning 
rule of the form 


= (61) 

[The upper and lower indexes on I distinguish it clearly from the inputs in the 0-th layer. We call this local deep 
learning (Figure 17.11) in contrast with deep local learning. The main point of the previous section was to show that 
local deep learning is more powerful than deep local learning, and local deep learning is necessary for reaching 
optimal weights. 
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Figure 7.1: Local deep learning. In general, deep local learning cannot learn complex functions optimally since it leads to 
architectures where only the weights in the top layer depend on the targets. For optimal learning, some information 1^^ (T) about 
the targets must be transmitted to each synapse associated with any deep layer h, so that it becomes a local variable that can be 
incorporated into the corresponding local learning rule ,Oj~^ ,w^j). 


Definition 1: Within the class of local deep learning algorithms, we define the subclass of deep targets local learning 
algorithms as those for which the information transmitted about the targets depends only on the postsynaptic unit, 
in other words I^jiT) = I^iT). Thus in a deep targets learning algorithm we have 

(62) 

for some function F (Figure 17.21) . 

We have also seen that when proper targets are available, there are efficient local learning rules for adapting the 
weights of a unit. In particular, the rule Aw = r]{T — 0)1 works well in practice for both sigmoidal and threshold 
transfer functions. Thus the deep learning problem can in principle be solved by providing good targets for the deep 
layers. We can introduce a second definition of deep targets algorithms: 

Definition 2: A learning algorithm is a deep targets learning algorithm if it provides targets for all the trainable units. 

Theorem: Definition 1 is equivalent to Definition 2. Furthermore, backpropagation can be viewed as a deep targets 
algorithm. 

Proof: Starting from Definition 2, if some target T-^ is available for unit i in layer h, then we can set in 

Definition 1. Conversely, starting from Definition 1, consider a deep targets algorithm of the form 

Aw^^=F{ll0t0';-\w':^) (63) 

If we had a corresponding target for this unit, it would be able to learn by gradient descent in the form 

Aw% = rj{Tt - 0^)0^-^ (64) 
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This is true both for sigmoidal transfer functions and for threshold gates, otherwise the rule should be slightly modified 
to accommodate other transfer functions accordingly. By combining Equations!^ andwe can solve for the target 


/j-^h _ 




h-l 




rjO 


h-l 


+ Of 


(65) 


assuming the presynaptic activity Oj~^ / 0 (note that = T) (Figure IT21) . In particular, we see that backpropaga- 
tion can be viewed as a deep targets algorithm providing targets for the hidden layers according to Equation |^in the 
form: 


Tt = I^ + of 


where /f = = dEerr/dS^ is exactly the backpropagated error. 
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Figure 7.2: Deep targets learning. This is a special case of local deep learning, where the transmitted information (T) about 
the targets does not depend on the presynaptic unit (T) = if (T). It can be shown that this is equivalent to transmitting a deep 
target Tf for training any unit i in any deep layer by a local supervised rule of the form Auif = T'(Tf, Of, In 

typical cases (linear, threshold, or sigmoidal units), this rule is Atcf = ? 7 (Tf — Of )Of“^. 


7.2 Deep Targets Algorithms: the Sampling Approach 

In the search for alternative to backpropagation, one can thus investigate whether there exists alternative deep targets 
algorithms |l^. More generally, deep targets algorithms rely on two key assumptions: (1) the availability of an 
algorithm 0 for optimizing any layer or unit, while holding the rest of the architecture fixed, once a fargef is provided; 
and (2) fhe availabilify of an algorifhm for providing deep fargefs. The maximizafion by 0 may be complefe or parfial, 
fhis opfimizafion faking place wifh respecf fo an error measure fhaf can be specific fo layer h in fhe archifecfure 
(or even specific fo a subsef of unifs in fhe case of an archifecfure where differenf unifs are found in fhe same layer). 
For insfance, an exacf opfimizafion algorifhm 0 is obvious in fhe unconsfrained Boolean case flSl- For a layer 
of fhreshold gales, 0 can be fhe perceplron algorifhm, which is exacf in fhe linearly separable case. For a layer of 
artificial neurons wifh differenliable Iransfer functions, 0 can be fhe della rule or gradienl descenl, which in general 
perform only parfial optimization. Thus deep fargefs algorifhms proceed according fo fwo loops: an oufer loop and an 
inner loop. The inner loop is used fo find suilable fargefs. The oufer loop uses fhese fargefs fo optimize fhe weighls, 
as if cycles fhrough fhe unifs and fhe layers of fhe archifecfure. 
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Figure 7.3: Deep architecture and deep targets algorithm. The algorithm visits the various layers according to some schedule 
and optimizes each one of them. This is achieved by the deep targets algorithm which is capable of providing suitable targets 
(T/*) for any layer h and any input /, assuming that the rest of the architecture is fixed. The targets are used to modify the 
weights associated with the function Bh- The targets can be found using a sampling strategy: sampling the activities in layer h, 
propagating them forward to the output layer, selecting the best output sample, and backtracking it to the best sample in layer h. 

7.2.1 Outer Loop 

The outerloop is used to cycle through, and progressively modify, the weights of a deep feedforward architecture: 

1. Cycle through the layers and possibly the individual units in each layer according to some schedule. Examples 
of relevant schedules include successively sweeping through the architecture layer by layer from the first layer 
to the top layer. 

2. During the cycling process, for a given layer or unit, identify suitable targets, while holding the rest of the 
architecture fixed. 

3. Use the algorithm 0 to optimize the corresponding weights. 

Step 2 is addressed by the following inner loop. 

7.2.2 Inner Loop: the Sampling Approach 

The key question of course is whether one can find ways for identifying deep targets Tj^, other than backpropagation, 
which is available only in differentiable networks. It is possible to identify targets by using a sampling strategy in 
both differentiable and non-differentiable networks. 

In the online layered version, consider an input vector I = I{t) and its target T = T{t) and any adaptive layer h, 
with 1 < h < L. We can write the overall input-output function WasW = Ah+iBhCh-i (Figure|73]l. We assume 
that both Afi^i and Ch-i are fixed. The input I produces an activation vector Of and our goal is to find a suitable 
vector target for layer h. For this we generate a sample of activity vectors in layer h. This sampling can 
be carried in different ways, for instance: (1) by sampling the values over the training set; (2) by small random 
perturbations, i.e. using random vectors sampled in the proximity of the vector = BhCh-i{I)',0) by large random 
perturbation (e.g. in the case of logistic transfer functions by tossing dies with probabilities equal to the activations) 
or by sampling uniformly; and (4) exhaustively (e.g. in the case of a short binary layer). Finally, each sample can 
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be propagated forward to the output layer and produce a corresponding output Ah+i{S^). We then select as the target 
vector the sample that produces the output closest to the true target T. Thus 


T™ = arg min 




(67) 


If there are several optimal vectors in S^j, then one can select one of them at random, or use to control the size 
of the learning step. For instance, by selecting a vector 5^ that not only minimizes the output error A^ but also 
minimizes the error A^(5^,0^), one can ensure that the target vector is as close as possible to the current layer 
activity, and hence minimizes the corresponding perturbation. As with other learning and optimization algorithms, 
these algorithmic details can be varied during training, for instance by progressively reducing the size of the learning 
steps as learning progresses (see Appendix D for additional remarks on deep targets algorithms). Note that the 
algorithm described above is the natural generalization of the algorithm introduced in |01 for the unrestricted Boolean 
autoencoder, specifically for the optimization of the lower l^er. Related reinforcement learning 1521] algorithms for 
connectionist networks of stochastic units can be found in ohl] .! 


7.3 Simulation 


Here we present the result of a simulation to show that sampling deep target algorithms can work and can even be 
applied to the case of non-differentiable networks where back propagation cannot be applied directly. A different 
application of the deep targets idea is developed in s. We use a four-adjustable-layer perceptron autoencoder with 
threshold gate units and Hamming distance error in all the layers. The input and output layers have Nq = = 100 

units each, and there are three hidden layers with Ni = 30, N 2 = 10, and A 3 = 30 units. All units in any layer h are 
fully connected to the units in the layer below, plus a bias term. The weights are initialized randomly from the 


uniform distribution U{— 




) except for the bias terms which are all zero. 


The training data consists of 10 clusters of 100 binary examples each for a total of M = 1000. The centroid of 
each cluster is a random 100 -bit binary vector with each bit drawn independently from the binomial distribution with 
p = 0.5. An example from a particular cluster is generated by starting from the centroid and introducing noise - each 
bit has an independent probability 0.05 of being flipped. The test data consists of an additional 100 examples drawn 
from each of the 10 clusters. The distortion function A^ for all layers is the Hamming distance, and the optimization 
algorithm 0 is 10 iterations of the perceptron algorithm with a learning rate of 1. The gradient is calculated in batch 
mode using all 1000 training examples at once. For the second layer with N 2 = 10, we use exhaustive sampling since 
there are only 2^*^ = 1024 possible activation values. For other layers where > 10, the sample comprises all 
the 1000 activation vectors of the corresponding layer over the training set, plus a set of 1000 random binary vectors 
where each bit is independent and 1 with probability 0.5. Updates to the layers are made on a schedule that cycles 
through the layers in sequential order: 1,2, 3,4. One cycle of updates constitutes an epoch. The trajectory of the 
training and test errors are shown in Figure [TA] demonstrating that this sampling deep targets algorithm is capable of 
training this non-differentiable network reasonably well. 


8 The Learning Channel and the Optimality of Backpropagation 

Armed with the understanding that in order to implement learning capable of reaching minima of the error function 
there must be a channel conveying information about the targets to the deep weights, we can now examine the three 
key questions about the channel: its nature, its semantics, and its rate. 

8.1 The Nature of the Channel 

In terms of the nature of the channel, regardless of the hardware embodiment, there are two main possibilities. Infor¬ 
mation about the targets can travel to the deep weights either by: ( 1 ) traveling along the physical forward connections 
but in the reverse direction; or ( 2 ) using a separate different channel. 

8.1.1 Using the Forward Channel in the Backward Direction 

In essence, this is the implementation that is typically emulated on digital computers using the transpose of the 
forward matrices in the backpropagation algorithm. However, the first thing to observe, is that even when the same 
channel is being used in the forward and backward direction, the signal itself does not need to be of the same nature. 
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Figure 7.4: A sampling deep target (DT) algorithm is used to train a simple autoencoder network of threshold gates thus purely 
comprised of non-differentiable transfer functions. The y axis correspond to the average Hamming error per component and per 
example. 


For instance the forward propagation could be electrical while the backward propagation could be chemical. This is 
congruent with the observation that the two signals can have very different time scales with the forward propagation 
being fast compared to learning which can occur over longer time scales. In biological neural networks, there is 
evidence for the existence of complex molecular signaling cascades traveling from the synapses of a neuron to the 
DNA in its nucleus, capable of activating epigenetics modifications and gene expression, and conversely for molecular 
signals traveling from the DNA to the synapses. At least in principle, chemical or electrical signals could traverse 
synaptic clefts in both directions. In short, while there is no direct evidence supporting the use of the same physical 
connections in both directions in biological neural systems, this possibility cannot be ruled out entirely at this time 
and conceivably it could be used in other hardware embodiments. It should be noted that a deep targets algorithm in 
which the feedback reaches the soma of a unit leads to a simpler feedback channel but puts the burden on the unit to 
propagate the central message from the soma to the synapses. 


8.1.2 Using a Separate Backward Learning Channel 


If a separate channel is used, the channel must allow the transfer of information from the output layer and the targets 
to the deep weights. This transfer of information could occur through direct connections from the output layer to 
each deep layer, or through a staged process of propagation through each level (as in backpropagation). Obviously 
combination of both processes are also possible. In either case, the new channel implements some form of feedback. 
Note again that the learning feedback can be slow and distinct in terms of signal, if not also in terms of channel, from 
the usual feedback of recurrent networks that is typically used in rapid dynamic mode to fine tune a rapid response, 
e.g. by helping combine a top down generative model with a bottom up recognition model. 

In biological neuronal circuits, there are plenty of feedback connections between different processing stages (e.g. 


112 ih l and some of these connections could serve as the primary channel for carrying the feedback signal necessary for 
learning. It must be noted, however, that given a synaptic weight Wij, the feedback could typically reach either the 
dendrites of neuron i, or the dendrites of neuron j or both. In general, the dendrites of the presynaptic neuron j are 
physically far away from the synapse associated with Wij which is located on the dendritic tree of the post-synaptic 
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neuron i, raising again a problem of information transmission within neuron j from its incoming synapses on its 
dendritic tree to the synapses associated with its output at the end of the arborization of its axon. The feedback reaching 
the dendrites of neuron i could in principle be much closer to the site where Wij is implemented, although this requires 
a substantial degree of organization of the synapses along the dendrites of neuron i, spatially combining synapses 
originating from feedforward neurons with synapses originating from feedback neurons, together with the biochemical 
mechanisms required for the local transmission of the feedback information between spatially close synapses. In 
short, the nature of the feedback channel depends on the physical implementation. While a clear understanding of 
how biological neural systems implement learning is still out of reach, the framework presented here-in particular the 
notions of local learning and deep targets algorithms-clarifies some of the aspects and potential challenges associated 
with the feedback channel and the complex geometry of neurons. 

An important related issue is the issue of the symmetry of the weights. Backpropagation uses symmetric (trans¬ 
posed) weights in the forward and backward directions in order to compute exact gradients. In a physical implemen¬ 
tation, especially one that uses different channels for the forward and backward propagation of information, it may 
be difficult to instantiate weights that are precisely symmetric. However simulations 84011 seem to indicate that, for 
instance, random weights can be used in the backward direction without affecting too much the speed of learning, or 
the quality of the solutions. Random weights in general result in matrices that have the maximum rank allowed by 
the size of the layers and thus transmit as much information as possible in the reverse direction, at least globally at 
the level of entire layers. How this global transmission of information allow precise learning is not entirely clear. But 
at least in the simple case of a network with one hidden layer and one output unit, it is easy to give a mathematical 
proof that random weights will support convergence and learning, provided the random weights have the same sign 
as the forward weights. It is plausible that biological networks could use non-symmetric connections, and that these 
connections could possibly be random, or random but with the same sign as the forward connections. 


8.2 The Semantics of the Channel 

Regardless of the nature of the channel, next one must consider the meaning of the information that is being trans¬ 
mitted to the deep weights, as well as its amount. Whatever information about the targets is fed back, it is ultimately 
used within each epoch to change the weights in the form 




-\h—l 




so that with small learning rates a Taylor expansion leads to 


( 68 ) 


1 

Eerriw’^j + = Eerr{Wij) + ^ + -(Aruf^-)*iT(Aw^) -f R (69) 

w.-h ^3 

= E,rM^) + G ■ (Amf,.) + + R (70) 

where G is the gradient, H is the Hessian, and R is the higher order remainder. If we let W denote the total number 
of weights in the system, the full Hessian has entries and thus in general is not computable for large W, which is 
the case of interest here. Thus limiting the expansion to the first order: 


EerMj + A^.) « + G • [Aw^^) = + r?||G||u • Q = E^rAw’^j) + v\\G\\0 (71) 

where u is the unit vector associated with the weight adjustments (rju = (Atu^)), g is the unit vector associated with 
the gradient {g = G/||G'||), and O = g ■ u. Thus to a first order approximation, the information that is sent back 
to the deep weights can be interpreted in terms of how well it approximates the gradient G, or how many bits of 
information it provides about the gradient. With W weights and a precision level of i9-bits for any real number, the 
gradient contains WD bits of information. These can in turn be split into D — I bits for the magnitude ||G|| of the 
gradient (a single positive real number), and {W — 1)D + 1 bits to specify the direction by the corresponding unit 
vector {g = G/| |G| |), using W — 1 real numbers plus one bit to determine the sign of the remaining component. Thus 
most of the information of the gradient in a high-dimensional space is contained in its direction. The information 
I^- determines Aw^^ and thus the main questions is how many bits the vector (Aw^j) conveys about G, which is 
essentially how many bits the vector u conveys about g, or how close is u to gl With a full budget of B bits per 
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weight, the gradient can be computed with B bits of precision, which defines a box around the true gradient vector, 
or a cone of possible unitary directions u. Thus the expectation of O provides a measure of how well the gradient is 
being approximated and how good is the corresponding optimization step (see next section). 

Conceivably, one can also look at regimes where even more information than the gradient is transmitted through 
the backward channel. This could include, for instance, second order information about the curvature of the error 
function. However, as mentioned above, in the case of large deep networks this seems problematic since with W 
weights, this procedure would scale like W^. Approximations that essentially compute only the diagonal of the 
Hessian matrix, and thus only W additional numbers, have been considered OQII . using a procedure similar to back- 
propagation that scales like W operations. These methods were introduced for other purposes (e.g. network pruning) 
and do not seem to have led to significant or practically useful improvements to deep learning methods. Furthermore, 
they do not change the essence of following scaling computations. 

8.3 The Rate and other Properties of the Channel 

Here we want to compare several on-line learning algorithms where information about the targets is transmitted back 
to the deep weights and define and compute a notion of transmission rate for the backward channel. We are interested 
in estimating a number of important quantities in the limit of large networks. The estimates do not need to be very 
precise, we are primarily interested in expectations and scaling behavior. Here all the estimates are computed for the 
adjustment of all the weights on a given training example and thus would have to be multiplied by a factor M for a 
complete epoch. In particular, given a training example, we want to estimate the scaling of: 

• The number Cw of computations required to transmit the backward information per network weight. The 
estimates are computed in terms of number of elementary operations which are assumed to have a fixed unif 
cost. Elementary operations include addition, multiplication, computing the value of a transfer function, and 
computing the value of the derivative of the transfer function. We also assume the same costs for the forward 
or backward propagation of information. Obviously these assumptions in essence capture the implementation 
of neural networks on digital computers but could be revised when considering completely different physical 
implementations. With these assumptions, the total number of computations required by a forward pass or a 
backpropagation through the network scales like W, and thus Cw = 1 for a forward or backward pass. 

• The amount of information Xw that is sent back to each weight. In the case of deep targets algorithms, we 
can also consider the amount of information that is sent back to each hidden unit, from which a value X>v 
can also be derived. We let D (for double precision) denote the number of bits used to represent a real number 
in a given implementation. Thus, for instance, the backpropagation algorithm provides D bits of information 
to each unit and each weight (lyy = Ij^ = D) for each training example, associated with the corresponding 
derivative. 

• We define the rate TZ of the backward channel of a learning algorithm by TZ = Zy^jCyv. It is the number 
of bits (about the gradient) transmitted to each weight through the backward channel divided by the number 
of operations required to compute/transmit this information per weight. Note that the rate is bounded by D: 
n < D. This is because the maximal information that can be transmitted is the actual gradient corresponding to 
D bits per weight, and the minimal computational/transmission cost must be at least one operation per weight. 

• It is also useful to consider the improvement or expected improvement O', or its normalized version O. All the 
algorithms to be considered ultimately lead to a learning step r]u where rj is the global learning rate and u is 
the vector of weight changes. To a first order of approximation, the corresponding improvement is computed 
by taking the dot product with the gradient so that O' = rju ■ G. In the case of (stochastic) gradient descent we 
have O' = rjG ■ G = 'qWGW^^. \n gradient descent, the gradient provides both a direction and a magnitude of the 
corresponding optimization step. In the perturbation algorithms to be described, the perturbation stochastically 
produces a direction but there is no natural notion of magnitude. Since when W is large most of the information 
about the gradient is in its direction (and not its magnitude), to compare the various algorithms we can simply 
compare the directions of the vector being produced,in particular in relation to the direction of the gradient. 
Thus we will assume that all the algorithms produce a step of the form rju where ||ri|| = 1 and thus O' = 
rju ■ G = r/||G||tt ■ g = r/||G||0. Note that the maximum possible value of O = u ■ g is one, and corresponds 
toO' = r]\\G\\. 

To avoid unnecessary mathematical complications associated with the generation of random vectors of unit length 
uniformly distributed over a high-dimensional sphere, we will approximate this process by assuming in some of 
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the calculations that the components of u are i.i.d. Gaussian with mean 0 and variance \jW (when perturbing the 
weights). Equivalently for our purposes, we can alternatively assume the components of u to be i.i.d. uniform over 
[—a, a], with a = ^JZjW , so that the mean is also 0 and variance 1/VE. In either case, the square of the norm 
V? tends to be normally distributed by the central limit theorem, with expectation 1. A simple calculation shows 
that the variance of is given by ‘IjW in the Gaussian case, and by in the uniform case. Thus in this case 

G ■ u tends to be normally distributed with mean 0 and variance CHGlP/W^ (for some constant C* > 0 ) so that 

O’ ^vVc\\G\\/Vw. 

In some calculations, we will also require all the components of u to be positive. In this case, it is easier to assume 
that the components of u are i.i.d. uniform over [0, a] with a = y’3/W, so that tends to be normally distributed 
by the central limit theorem, with expectation 1 and variance 4/5W^. Thus in this case G ■ u tends to be normally 
distributed with mean {y^3/W /2) Gi and variance | IG] p/(4VE) so that O' ss t]{^J3]W /2) Yl,i Gi- 

8.4 A Spectrum of Descent Algorithms 

In addition to backpropagation (BP) which is one way of implementing stochastic gradient descent, we consider 
stochastic descent algorithms associated with small perturbations of the weights or the activities. These algorithms 
can be identified by a name of the form P{W or A}{L or G}{B or R}{0 or K}. The perturbation (P) can be applied 
to the weights (W) or, in deep targets algorithms, to the activities (A). The perturbation can be either local (L) when 
applied to a single weight or activity, or global (G) when applied to all the weights or activities. The feedback provided 
to the network can be either binary (B) indicating whether the perturbation leads to an improvement or not, or real (R) 
indicating the magnitude of the improvement. Finally, the presence of K indicates that the corresponding perturbation 
is repeated K times. For brevity, we focus on the following main cases (other cases, including intermediary cases 
between local and global where, for instance, perturbations are applied layerwise can be analyzed in similar ways and 
do not offer additional insights or improvements): 

• PWGB is the stochastic descent algorithm where all the weights are perturbed by a small amount. If the 
error decreases the perturbation is accepted. If the error increases the perturbation is rejected. Alternatively the 
opposite perturbation can be accepted, since it will decrease the error (in the case of differentiable error function 
and small perturbations), however this is detail since at best it speeds things up by a factor of two. 

• PWFR is the stochastic descent algorithm where each weight in turn is perturbed by a small amount and the 
feedback provided is a real number representing the change in the error. Thus this algorithm corresponds to the 
computation of the derivative of the error with respect to each weight using the definition of the derivative. In 
short, it corresponds also to stochastic gradient descent but provides a different mechanism for computing the 
derivative. It is not a deep targets algorithm. 

• PWFB is the binary version of PWFR where only one bit, whether the error increases or decreases, is transmit¬ 
ted back to each weight upon its small perturbation. Thus in essence this algorithms provides the sign of each 
component of the gradient, or the orthant in which the gradient is located, but not its magnitude. After cycling 
once through all the weights, a random descent unit vector can be generated in the corresponding orthant (each 
component of Ui has the sign of gi). 

• PAFR is the deep targets version of PWFR where the activity of each unit in turn is perturbed by a small amount, 
thus providing the derivative of the error with respect to each activity, which in turn can be used to compute the 
derivative of the error with respect to each weight. 

• PWGBK is similar to PWGB, except that K small global perturbations are produced, rather than a single one. 
In this case, the binary feedback provides information about which perturbation leads to the largest decrease in 
error. 

• PWGRK is similar to PWGBK except that for each perturbation a real number, corresponding to the change in 
the error, is fed back. This corresponds to providing the value of the dot product of the gradient with K different 
unit vector directions. 

8.5 Analysis of the Algorithms: the Optimality of Backpropagation 
8.5.1 Global Weight Perturbation with Binary Feeback (PWGB) 

For each small global perturbation of the weights, this algorithm transmits a single bit back to all the weights, cor¬ 
responding to whether the error increases or decreases. This is not a deep targets algorithm. The perturbation itself 
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requires one forward propagation, leading to Xyv = \jW and Cw = 1- Thus: 


• Xw = 


• Cw = 1 


• 7^ = 1/iy 

• O' = rjC\\G\\/^/W for some eonstant C* > 0, so that O = Cj^/W 

8.5.2 Local Weight Perturbation with Real Feedback (PWLR) 

This is the definition of the derivative. The derivative dE^rr /ean also be eomputed direetly by first perturb¬ 
ing by a small amount e, propagating forward, measuring AE^rr = E^rriwij + e) — Eerr{wij) and then using 
dEerr/dwij ^ AE^rr/^- This is not a deep target algorithm. The algorithm eomputes the gradient and thus prop¬ 
agates D bits baek to eaeh weight, at a total eomputational eost that seales like W per weight, sinee it essentially 
requires one forward propagation for eaeh weight. Thus: 

• X>v = D 

• Cvv = W 


• n = D/W 

• O' = rj\\G\ \ for a step r]g, and thus 0 = 1 

8.5.3 Local Weight Perturbation with Binary Feedback (PWLB) 

This is not a deep target algorithm. The algorithm provides a single bit of information baek to eaeh weigh and requires 
a forward propagation to do so. Without any loss of generality, we ean assume that all the eomponents of the final 
random deseent veetor Ui must be positive. Thus: 

• Tvv = 1 

• Cw = W 

• 7^ = 1/W 

• O' = 12) Gi, and thus O = (vW^/2) T.i9^ 

8.5.4 Local Activity Perturbation with Real Feedback (PALR) 

This is a deep target algorithm and from the eomputation of the derivative with respeet to the aetivity of eaeh unit, 
one ean derive the gradient. So it provides D bits of feedbaek to eaeh unit, as well as to eaeh weight. The algorithm 
requires in total N forward propagations, one for eaeh unit, resulting in a total eomputational cost of NW or N per 
weight. Thus: 


• Xyv = Xjg- = D 


• Cw = N 

• 7^ = D/N 

• O' = r]\\G\ \ for a step gg, and thus 0 = 1 
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8.5.5 Global Weight Perturbation with Binary Feedback K Times (PWGBK) 


In this version of the algorithm, the information backpropagated is which of the K perturbation leads to the best 
improvement, corresponding to the same log K bits for all the weights. The total cost is K forward propagations. Note 
that the K perturbations constrain the gradient to be in the intersection of K hyperplanes, and this corresponds to more 
information than retaining only the best perturbation. However the gain is small enough that a more refined version of 
the algorithm and the corresponding calculations are not worth the effort. Thus here we just use the best perturbation. 
We have seen that for each perturbation the dot product of the corresponding unit vector u with G is essentially 
normally distributed with mean 0 and variance CHGlP/W. The maximum of K samples of a normal distribution 
(or the absolute value of the s amp les if random ascending directions are inverted into descending directions) follows 
an extreme value distribution lIlTL 12311 and the average of the maximum will scale like the standard deviation times a 


factor ^/logK up to a multiplicative constant. Thus: 


Zy^ = \ogK/W 


• Cy^ = K 


• n = {logKlW)lK 

• O' = r]C\\G\\\/log K / y/W for some constant C > 0, and thus O = G \/log K / y/W 

8.5.6 Global Weight Perturbation with Real Feedback K Times (PWGRK) 

This algorithm provides KD bits of feedback in total, or KD/W per weight and requires K forward propagations. In 
terms of improvements, let us consider that the algorithms generates K random unit vector directions ..., 
and produces the K dot products ■ G,... , ■ G. In high dimensions {W large), the K random directions are 

approximately orthogonal As a result of this information, one can select the unit descent direction given by 

Now we have 

II • G)rt(^)||2 « • G)^ « 

k=l k=l 

for some constant C > 0. The first approximation is because the vectors are roughly orthogonal, and the second 
is simply by taking the expectation. As a result, O = rju ■ G = riGy/K\\G\\/y/W for some constant G > 0. Thus: 

• Xw = KD/W 


• Cyv = K 


• n = D/w 

• O' = riGy/K\\G\\/y/W, and thus O = GyfKjylW 


8.5.7 Backpropagation (BP) 

As we have already seen: 


• Xyy = Xy\/ = D 


• C = 1 


n = D 

O' = r]\\G\ \ for a step r]g, and thus 0 = 1 
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Algorithm 

Information Xyp 

Computation Cw 

Rate TZ 

Improvement O 

PWGB 

IjW 

1 

IjW 

cjVw 

PWLR 

D 

W 

DjW 

1 

PWLB 

1 

W 

IjW 

(V3/VF/2) Z^9^ 

PALR 

D 

N 

DfN 

1 

PWGBK 

log K/W 

K 

{log K/W)/K 

C^logK/VW 

PWGRK 

KD/W 

K 

DjW 

CVKjVW 

BP 

D 

1 

D 

1 


Table 8: The rate TZ and improvement O of several optimization algorithms. 


The reason that no algorithms better than backpropagation has been found is that the rate TZ of backpropagation 
is greater or equal to that of all the alternatives considered here (Table [H). This is true also for the improvement 
O. Furthermore, there is no close second: all the other algorithms discussed in this section fall considerably behind 
backpropagation in at least one dimension. And finally, it is unlikely that an algorithm exists with a rate or improve¬ 
ment higher than backpropagation, because backpropagation achieves both the maximal possible rate, and maximal 
possible improvement (Figure HU]), up to multiplicative constants. Thus in conclusion we have the following theorem: 

Theorem: The rate TZ of backpropagation is above or equal to the rate of all the other algorithms described here 
and it achieves the maximum possible value 7Z = D. The improvement O of backpropagation is above or equal to 
the improvement of all the other algorithms described here and it achieves the maximum possible value O = 1 (or 

0' = v\\G\\). 


Non feasible 


Non feasible 


1 


-Q 

BP 


All other algorithms 
(PWLR, etc) 


Non feasible 


D 



Figure 8.1: Backpropagation is optimal in the space of possible learning algorithms, achieving both the maximal possible rate 
TZ = D and maximal expected improvement 0 = 1. 


8.6 Recurrent Networks 


Remarkably, the results of the previous sections can be extended to recurrent, as well as recursive, networks (see ll57ll 
for an attempt at implementing backpropagation in recurrent networks using Hebbian learning). To see this, consider 
a recurrent network with W connection weights, where the connections can form directed cycles. If the network is 
unfolded in time over L time steps, one obtains a deep feedforward network (Figure 18.21) . where the same sets of 
original weights from the recurrent networks is used to update all the unit activations, from one layer (or time step) 
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to the next. Thus the unfolded version has a set of W weights that are shared L times. In the recurrent case, one 
may have targets for all the units at all the time steps, or more often, targets may be available only at some time 
steps, and possibly only for some of the units. Regardless of the pattern of available targets, the same argument used 
in Section 6.4 to expose the limitations of deep local learning, exposes the limitations of local learning in recurrent 
networks. More precisely, under the assumption that the error function is a differentiable function of the weights, any 
algorithm capable of reaching an optimal set of weights- where all the partial derivatives are zero-must be capable of 
“backpropagating” the information provided by any target at any time step to all the weights capable of influencing 
the corresponding activation. This is because a target T/ for unit i at time I will appear in the partial derivative of any 
weight present in the recurrent network that is capable of influencing the activity of unity i in I steps or less. Thus, 
in general, an implementation capable of reaching an optimal set of weights must have a “channel” in the unfolded 
network capable of transmitting information from back to all the weights in all the layers up to I that can influence 
the activity of unity i in layer 1. Again in a large recurrent network the maximal amount of information that can be sent 
back is the full gradient and the minimal number of operations required typically scales like WL. Thus this shows 
that the backpropagation through time algorithm is optimal in the sense of providing the most information, i.e. the 
full gradient, for the least number of computations {WL). 

Boltzmann machines |[ll], which can be viewed as a particular class of recurrent networks with symmetric con¬ 
nections, can have hidden nodes and thus be considered deep. Although their main learning algorithm can be viewed 
as a form of simple Hebbian learning {Awij oc< OiOj >damped — < OiOj >free)^ they are no exception to the 
previous analyses. This is because the connections of a Boltzmann machines provide a channel allowing information 
about the targets obtained at the visible units to propagate back towards the deep units. Furthermore, it is well known 
that this learning rule precisely implements gradient descent with respect to the relative divergence between the true 
and observed distributions of the data, measured at the visible units. Thus the Hebbian learning rule for Boltzmann 
machines implements a form of local deep learning which in principle is capable of transmitting the maximal amount 
of information, from the visible units to the deep units, equal to the gradient of the error function. What is perhaps 
less clear is the computational cost and how it scales with the total number of weights W, since the learning rule in 
principle requires the achievement of equilibrium distributions. 

Finally, the nature of the learning channel, and its temporal dynamics, in physical recurrent networks, including 
biological neural networks, are important but beyond the scope of this paper. However, the analysis provided is already 
useful in clarifying that the backward recurrent connections could serve at least three different roles: (1) a fast role to 
dynamically combine bottom-up and top-down activity, for instance during sensory processing; (2) a slower role to 
help carry signals for learning the feedforward connections; and (3) a slower role to help carry signals for learning the 
backward connections. 


9 Conclusion 

The concept of Hebbian learning has played an important role in computational neuroscience, neural networks, and 
machine learning for over six decades. However, the vagueness of the concept has hampered systematic investiga¬ 
tions and overall progress. To redress this situation, it is beneficial to expose two separate notions: the locality of 
learning rules and their functional form. Learning rules can be viewed as mathematical expressions for computing the 
adjustment of variables describing synapses during learning, as a function of variables which, in a physical system, 
must be local. Within this framework, we have studied the space of polynomial learning rules in linear and non-linear 
feedforward neural networks. In many cases, the behavior of these rules can be estimated analytically and reveals how 
these rules are capable of extracting relevant statistical information from the data. However, in general, deep local 
learning associated with the stacking of local learning rules in deep feedforward networks is not sufficient to learn 
complex input-output functions, even when targets are available for the top layer. 

Learning complex input-output functions requires a learning channel capable of propagating information about the 
targets to the deep weights and resulting in local deep learning. In a physical implementation, this learning channel 
can use either the forward connections in the reverse direction, or a separate set of connections. Furthermore, for 
large networks, all the information carried by the feedback channel can be interpreted in terms of the number of bits 
of information about the gradient provided to each weight. The capacity of the feedback channel can be defined in 
terms of the number of bits provided about the gradient per weight, divided by the number of required operations per 
weight. The capacity of many possible algorithms can be calculated, and the calculations show that backpropagation 
outperforms all other algorithms as it achieves the maximum possible capacity. This is true in both feedforward and 
recurrent networks. It must be noted, however, that these results are obtained using somewhat rough estimates-up 
to multiplicative constants-and there may be other interesting algorithms that scale similarly to backpropagation. In 
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Figure 8.2: Left: Recurrent neural network with three neurons and W = 6 connection weights. Right: The same network 
unfolded through time for L steps, producing a deep feedforward network where the weights are shared between time steps. 
Each original weight is replicated L times . Targets are available for a subset of the layers (i.e. time steps). In order to reach an 
optimal set of weights, a learning algorithm must allow each individual target to influence all the copies of all the weights leading 
to the corresponding unit. 


particular, we are investigating the use of random, as opposed to symmetric, weights in the learning channel, which 
seems to work in practice ll^ . 

The remarkable optimality of backpropagation suggests that when deploying learning systems in computer envi¬ 
ronments with specific constraints and budgets (in terms of big data, local storage, parallel implementations, com¬ 
munication bandwidth, etc) backpropagation provides the upper bound of what can be achieved, and the model that 
should be emulated or approximated by other systems. 

Likewise, the optimality of backpropgation leads one to wonder also whether, by necessity, biological neural 
systems must have discovered some form of stochastic gradient descent during the course of evolution. While the 
question of whether the results presented here have some biological relevance is interesting, several other points must 
be taken into consideration. First, the analyses have been carried in the simplified supervised learning selling, which 
is nol meanl lo closely malch how biological syslems learn. Whelher Ihe supervised learning selling can approximate 
al leasl some essential aspecls of biological learning is an open question, and so is Ihe related question of extending 
Ihe Iheory of local learning lo olher forms of learning, such as reinforcemenl learning 15211 . 

Second, Ihe analyses have been carried using artificial neural nelwork models. Again Ihe question of whelher 
Ihese nelworks caplure some essential properties of biological nelworks is nol settled. Obviously biological neurons 
are very complex biophysical information processing machines, far more complex lhan Ihe neurons used here. On 
Ihe olher hand, Ihere are several examples in Ihe lileralure (see, for inslance, |^, 45, where imporlanl 

biological properties seem lo be caplured by artificial neural nelwork models. [In fad Ihese resulls, laken logelher 
wilh Ihe sometimes superhuman performance of backpropagation and Ihe oplimalily resulls presented here, lead us 
lo conjeclure paradoxically lhal biological neurons may be frying lo approximate artificial neurons, and nol Ihe olher 
way around, as has been assumed for decades.] Bui even if Ihey were subslanlially unrelated lo biology, artificial 
neural nelworks still provide Ihe besl simple model we have of a conneclionisl style of compulation and information 
storage, entirely differed from Ihe style of digilal computers, where information is bolh scattered and superimposed 
across synapses and intertwined wilh processing, ralher lhan stored al specific memory addresses and segregated from 
processing. 
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In any case, for realistic biological modeling, the complex geometry of neurons and their dendritic trees must be 
taken into consideration. For instance, there is a significant gap between having a feedback error signal Bi arrive at 
the soma of neuron i, and having Bi available as a local variable in a far away synapse located in the dentritic tree 
of neuron i. In other words, Bi must become a local variable at the synapse Wij. Using the same factor of 10® from 
the Introduction, which rescales a synapse to the size of a fist, this gap could correspond to tens or even hundreds of 
meters. Furthermore, in a biological or other physical system, one must worry about locality not only in space, but 
also in time, e.g. how close must Bi and Oj be in time? 

Third, issues of coordination of learning across different brain components and regions must also be taken into 
consideration (e.g. 15311 1. And finally, a more complete model of biological teaming would have fo include nol only 
largel signals fhaf are backpropagafed elecfrically, buf ulfimafely also fhe complex and slower biochemical processes 
involved in synapfic modification, including gene expression and epigenetic modifications, and the complex produc¬ 
tion, transport, sequestration, and degradation of protein, RNA, and other molecular species (e.g. OSLUlL 15411 1. 

However, white there is no definitive evidence in favor or against the use of stochastic gradient descent in biologi¬ 
cal neural systems, and obtaining such evidence remains a challenge, biological deep teaming must follow the locality 
principle and thus the theory of local learning provides a framework for investigating this fundamental question. 


Appendix A: Uniqueness of Simple Hebb in Hopfield Networks 

A Hopfield model can be viewed as a nefwork of [— 1,1] threshold gates connected symmetrically (wij = Wji) with 

no self-connections (wu = 0). As a result the network has a quadratic energy function E = —(1/2) WijOiOj and 
the dynamics of the network under stochastic asynchronous updates converges to local minima of the energy function. 
Given a set 5 of M memory vectors the simple Hebb rule is used to produce an energy 

function Es to try to store these memories as local minima of the energy function so that Wij = Yk 
induces an acyclic orientation 0{S) of the N dimensional hypercube T-L. If /i is an isometry of T-L for the Hamming 
distance, then for the simple Hebb rule we have 0{h{S)) = h{0{S)). Are there any other teaming rules with the 
same property? 

We consider here learning rules with d = 0. Thus we must have Awij = E{Oi, Oj) where F is a polynomial 
function. On the [-1,1] hypercube, we have (9? = = 1 and thus we only need to consider the case n = 2 with 

E{Oi, Oj) = aOiOj + pOi + 'yOj + 5. However the teaming rule must be symmetric in i and j to preserve the 
symmetry Wij = Wji. Therefore E can only have the form F{Oi,Oj) = aOiOj -|- fi{Oi + Oj) + 7. Finally, the 
isometric invariance must be true for any set of memories S. It is easy to construct examples, with specific sefs S, fhaf 
force fj and 7 fo be 0. Thus in fhis sense fhe simple Hebb rule Awij = aOiOj is fhe only isomefric invarianf learning 
rule for fhe Hopfield model. A similar resulfs can be derived for spin models wifh higher-order inferacfions where fhe 
energy function is a polynomial of degree n > 2 in fhe spin variables jsl] . 


Appendix B: Invariance of the Gradient Descent Rule 

In fhe [0,1] case wifh fhe logisfic fransfer function, fhe goal is fo minimize fhe relative enfropy error 


E = - [TlogO + {l-T) log(l - O)] 
Therefore 


(74) 


nr '~r _^ 

W ^ 0(1 - o) ^ ^ ~ ^ ~ 

In fhe [—1,1] case wifh fhe tanh fransfer function, fhe equivalenf goal is fo minimize 


(75) 


E' = - 


T' + l O' + l l-T' I-O' 
—-— log — 1 — log —-— 


(76) 


where T = and O = Y ■ Therefore 


dE' _ 2{T' O') ^ ^ 1 _ 0/2 ^ ^ 2 (r' - o')i'i 


dO' 


l-0'2 


dS' 


(77) 
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Thus the gradient deseent learning rule is the same, up to a faetor of 2 whieh ean be absorbed by the learning rule. The 
origin of this faetor lies in the faet that tanh(x) = (1 — /(I + is aetually not the natural [—1,1] equivalent 

of the logistie funetion a{x). The natural equivalent is 


tanh - = 2(t{x) - 1 = -—^ (78) 

2 ^ ^ 1 + e-x ^ ^ 

Appendix C: List of New Convergent Learning Rules 

All the rules are based on adding a simple deeay term to the simple Hebb rule and its supervised variants. 


Fixed Decay 

AtUjj oc OiOj — Cwij with C > 0 (79) 

with the supervised elamped version 

AtUjj oc TiOj — Cwij (80) 

and the gradient deseent version 

Awij oc (Tj - Oi)Oj - Cwij (81) 

Adaptive Decay Depending on the Presynaptic Term 

Awij oc OiOj — OjWij (82) 

with the supervised elamped version 

Awij oc TiOj — OjWij (83) 

and the gradient deseent version 

Awij oc {Ti - Oi)Oj - OjWij (84) 

Adaptive Decay Depending on the Postsynaptic Term 

Awij oc OiOj — Ofwij (85) 

This is Oja’s rule, whieh yields the supervised elamped versions 

Awij oc TiOj — Ofwij and Awij oc TiOj — Tfwij (86) 

and the gradient deseent versions 

Awij oc {Ti - Oi)Oj - OiWij andAwij oc (Tj - Oi)Oj - {Ti - Oifwij (87) 
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( 88 ) 


Adaptive Decay Depending on the Pre- and Post-Synaptic (Simple Hebb) Terms 

^Wij CX OiOj iyOiOj) Wij — OiOjWij^ 

with the clamped versions 

l^Wij CX TiOj — {OiOj)'^Wij andAwjj ex TiOj — {TiOj)'^Wij (89) 

and gradient descent versions 

Awij (X {Ti - Oi)Oj - {OiOjfwij andAwij oc (Tj - Oi)Oj - {{Ti - 

We now consider the alternative approach which bounds the weights in a [ 
initial values of the weights are assumed to be small or 0 . 

Bounded Weights 

Awij oc OiOj{C - wfj) (91) 

with the clamped version 

Awij ocTiOj{C - wfj) (92) 

and the gradient descent version 


- Oi)Oj)^Wij (90) 

—C, C\ range for some C < 0. The 


Awij oc {Ti - Oi)Oj{C - wfi) 


(93) 


Appendix D: Additional Remarks on Deep Targets Algorithms 

1 ) In many situations, for a given input vector I there will be a corresponding and distinct activity vector 0^~^. 

However, sometimes the function Ch-i may not be injective in which cases several input vectors /(fi),... ,I{tk), 
with final targets T{ti),... ,T{tk), may get mapped onto the same activity vector = Ch-i{I{ti)) = ... = 

Ch-i{I{tk))- In this case, the procedure for determining the target vector may need to be adjusted slightly as 
follows. First, the sample of activity is generated and propagated forward using the function A/i+i, as in the 
non-injective case. However the selection of the best output vector over the sample may take into consideration all 
the targets T{ti),... ,T{tk) rather than the isolated target associated with the current input example. For instance, the 
best output vector may be chosen as to minimize the sum of the errors with respect to all these targets. This procedure 
is the generalization of the procedure used to train an unrestricted Boolean autoencoder |0]. 

2) Depending on the schedule in the outer loop, the sampling approach, and the optimization algorithm used in the 
inner loop, as well as other implementation details, the description above provides a family of algorithms, rather than 
a single algorithm. Examples of schedules for the outerloop include a single pass from layer 1 to layer L, alternating 
up-and down passes along the architecture, cycling through the layers in the order 1,2,1,2,3,1,2,3,4, etc, and their 
variations. 

3) The sampling deep targets approach can be combined with all the other “tricks” of backpropagation such as weight 
sharing and convolutional architectures, momentum, dropout, and so forth. Adjustable learning rates can be used with 
different adjustment rules for different learning phases 115, 1^ . 

4) The sampling deep targets approach can be easily combined also with backpropagation. For instance, targets can 
be provided for every other layer, rather than for every layer, and backpropagation used to train pairs of adjacent 
layers. It is also possible to interleave the layers over which backpropagations is applied to better stitch the shallow 
components together (e.g. use backpropagations for layers 3,2,1 then 4,3,2, etc). 

5) When sampling from a layer, here we have focused on using the optimal output sample to derive the target. It may 
be possible instead to leverage additional information contained in the entire distribution of samples. 

6) In practice the algorithm converges, at least to a local minima of the error function. In general the convergence 
is not monotonic (Figure 17.41) . with occasional uphill jumps that can be beneficial in avoiding poor local minima. 
Convergence can be proved mathematically in several cases. For instance, if the optimization procedure can map each 
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hidden activity to each corresponding target over the entire training set, then the overall training error is guaranteed 
to decrease or stay constant at each optimization step and hence it will converge to a stable value. In the unrestricted 
Boolean case (or in the Boolean case with perfect optimization), with exhaustive sampling of each hidden layer the 
algorithm can also be shown to be convergent. Finally, it cm _^o be shown to be convergent in the framework of 
stochastic learning and stochastic component optimization 


7 A different kind of deep targets algorithm, where the output targets are used as targets for all the hidden layers, is 
described in il9t] . The goal in this case is to force successive hidden layers to refine their predictions towards the final 
target. 
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